RAG: The Architecture of Reliable AI

Dieser Blogpost ist auch auf Deutsch verfügbar

This is my presentation on Retrieval-Augmented Generation from the JUG.ch meetup in Bern on January 28, 2025. I’ve transcribed and redigated this as a Talk+ for later reference. The original talk has been given in German, you’re reading a translation.

Have you noticed that for the past two years everyone’s been saying “AI” instead of “KI”? I sometimes have this crazy theory: It’s because previously, we always used “KI” when talking about machine learning and everything we weren’t actually doing – because it was too complicated, because we didn’t have the expertise in-house, because it wasn’t powerful enough.

Since that ChatGPT moment, everyone says “AI” because people always need terms to distinguish things and categorize them. Maybe it’s just human that everyone now says “AI” when they mean the “powerful” AI or the generative kind, right? It’s just a theory I have.

We‘d love to show you a YouTube video right here. To do that, we need your consent to load third party content from youtube.com

Show content (Privacy)

Thomas already introduced me. There’s not much to say about me: I work at INNOQ Germany, where I handle the Data & AI business, and Christian, who’s on the board, is with INNOQ Switzerland. I once had the honor of working on a project for a Swiss client for two years. I was in Zurich very often and got to know and appreciate Switzerland. That’s why I’m experiencing my first scandal by Swiss standards today: We’re starting almost 8 minutes late, which is almost at German levels. I also traveled a lot by train to and from Zurich Airport, and once the train to the airport at Zurich Main Station announced a 3-minute delay – and there was such a murmur on the platform, everyone was really displeased. That doesn’t happen in Germany. People only start getting upset after 60 minutes.

Sunset over the sea with reflective water. Abstract, reflective monolith in the water, surrounded by waves. The text 'AI und jetzt' with 'jetzt' highlighted in green, and 'undjetzt.ai' on the right against a dark background.

I have a small podcast called “AI und jetzt” (“AI and now”) that I do with a colleague. It’s not very technical. It’s less about software architecture and development, and more about what AI means – not just for society. We talk with a variety of people: We’ve had a designer, a behavioral mathematician who’s very deep into the topic, including at the development level. We’ve had a tech journalist from the FAZ. We talk with them about how they use it in their daily tasks, what they do with it, and how they assess it. If you’re interested, check it out.

Screenshot of text about the development of the software architecture of the e-commerce platform of 'Snobby Wine Connoisseurs GmbH', including the roles of CTO, e-commerce manager, and senior architects, as well as the use of Architecture Decision Records (ADRs).

Why I’m here today: Take a look at this, it’s a screenshot from ChatGPT with the latest model. When I was building this slide, I was looking for a relatable example that software architects could identify with. My example is: I’m a software architect, new to a company. In this case, it’s a high-end wine online shop, the Snobby Wine Connoisseurs GmbH, and my task is to understand the architecture. How and why is the architecture for this e-commerce shop the way it is? Who were the decision-makers? How did it come about? I want to understand this because my task might be modernization, or something else – a typical architectural task.

I threw this question into ChatGPT, although I knew exactly what should come out. ChatGPT can’t know this unless, by coincidence, I hit the jackpot with the company name and this company was well-known and had a data leak two years ago that could have run into AI training. What I saw there, I was glad I could screenshot it because I never got it to happen again. It was like a rip in the space-time continuum. Actually, these latest, largest foundation models are so good that they hallucinate significantly less than smaller models. And here it hallucinated in the most brutal way.

We don’t want that, we never want that – at least when it’s not the assignment. If I say “make up a story about how the software architecture came to be,” because I might need synthetic documentation, then that’s great, but that’s not what I ordered. In the end, maybe it was my prompt. I hammered it in again over days, and it never did it again.

Screenshot of a conversation: Question about architecture and decision-makers of the e-commerce store of 'Snobby Wine Connoisseurs GmbH'. Answer: No details available, general insights into e-commerce in the wine industry.

It actually always did what came next: “Sorry, I can’t help you, I don’t have any info about that, but we can discuss software architecture in the wine industry.” That’s what we expect, that’s what we want – at least when we don’t give it the job to make things up.

Screenshot of a dialog with a question about the architecture of an e-commerce store of 'Snobby Wine Connoisseurs GmbH' and the response from Claude 3.5 Sonnet with a reference to security boundaries.

With other models, here Claude 3.5 Sonnet, it’s exactly the same: They offer help but also tell you where they can’t help. And that’s how we want it. This is reliable AI. That’s what we want when we use AI in a company. These models are gigantic; people often talk about “world knowledge” in quotation marks, but it’s not a knowledge base where I can open drawers and pull out information, exact data on when something happened. These are gigantic neural networks, and if my internal company data didn’t go into the training, deliberately or not, I can’t get answers about it.

Text image with the heading 'What's the problem here?' and the points: 'Immensely powerful Large Language Models' that 'don't know company-specific knowledge', tend to 'hallucinate more' and have 'knowledge cut off at training start'.

So what’s the problem if we’re now a software architect at Snobby Wine Connoisseurs? We have a technology here where many smart people say we’ll only experience it once in our generation. It’s a General Purpose Technology, to be put on the same level as the steam engine, the internet, electricity, some say. Because I can use it for everything. The boundary is jagged: There are use cases that don’t work so well today, but work in principle. There are others where it works excellently, and we all have to figure out in this sea of use cases how this boundary runs.

No one explains this to us. Above all, we as technicians and IT people don’t explain this to society, though we’ve always had the job of explaining technology to society, explaining digitization, applying it. We’re now on a level with everyone else. It doesn’t work anymore, we can’t explain it anymore. We have to look at each use case, we have to look at our own use cases, suddenly we’re also exposed.

Now we want to use these Large Language Models or Foundation Models for our job here. We want to use them for our architectural work.

Text in image: „So how can we make our internals 'known'?

How do I get the company data in there? How do I make it known to the model?

White background with the text 'Get them in the prompt!' in bold blue-green font centered.

There’s a solution and it’s totally simple: Just put it in the prompt.

White background with dark green text 'That was easy, we're done here.', signaling completion or simple resolution of a task.

That’s it. That’s what I need to do.

White background with text: 'The simplest solution is the best: put everything in the prompt if it fits within the context window.'

You’ve all used ChatGPT or maybe local models. You know a bit about how they work. The prompt sets the context for the task I’m giving the model. So we can actually say it was simple, are we done here? I always advise that when we want to build an architecture, develop software, build a feature that should connect an AI with internal knowledge, with verified internal information, always take the simplest solution first. That’s generally a good idea anyway. Because I learn so much along the way – maybe the simplest solution isn’t the best, but I don’t have to tear down ivory towers anymore.

If I now search the internet for what I need for RAG, then I first get frameworks and tools thrown at me. You need to install this and that, this and that LangChain, and then there’s this and that plugin, then you need a vector database, then you need an ingestion pipeline for your data into the vector database. It’s a whole new set of technologies for many of us, and that simply leads to us often not dealing with it due to lack of time.

What I’m saying is: For RAG, we actually need almost nothing at first. We need the core technology, the Large Language Model, and we need to know how to get our knowledge in there. But that’s the big hammer that we somehow have to architect around. We need to get the data into the prompt, if it fits in there.

Graphic with white background and text: 'The simplest solution is the best: put everything in the prompt if it fits within the context window.' Words are highlighted with different levels of boldness.

It would all have been so simple if we weren’t faced with this in everyday life. For example, our Confluence, where all our architecture documentation is, should now please go into the model. Copy-paste is tedious. And even if I do it, I quickly realize: Oh oh, it’s getting full, it just doesn’t fit anymore, it’s too much.

Two men in an office; one in a yellow shirt with a serious look, the other in a suit with an uncomfortable grimace expression.

What do we do then? Then most people in their PoCs in the company stand just like this and give up. That’s actually the point where we start with an architectural solution, which can be RAG.

Text on white background: 'Well, nice meeting you, Context Window.' 'Context Window' is bold.

That’s why we first need to say “hello” to the context window and get to know it a little better.

What is the context window anyway? The context window is basically like – as an analogy – the human brain. Large Language Models don’t work exactly like that, but it’s a good analogy. It’s like the short-term memory: Everything we put on the intern’s desk – “do this job, here’s our info about it.” That’s the context window, and when the desk is full, it’s full.

Slide heading 'Context Window' with explanation of working memory size and table of token limits: GPT-4o and Llama 3.2 each 128,000 tokens, Claude 3.5 Sonnet 200,000 tokens, Gemini 1.5 Pro 2,000,000 tokens. Date: 2024-11-11.

The leading models – this slide becomes outdated every two weeks, I can’t even keep up with updating it. Llama 3.3 is out now, Gemini 2.0 will probably be finalized this week or next. Funnily enough, a lot changes, but not the context windows. They basically stay the same. Google apparently has a mode in which they can manage very large context windows without “needle in the haystack” problems. There, I can throw in a thick book as context in 2 million tokens and can also ask targeted questions about information in a sentence in the middle. There are tests showing it works well – that wasn’t always the case. But no matter what we use, and these are the largest models that exist – context window is not tied to model size at first.

Visualization of tokenization: Text is broken down into color-coded tokens with associated IDs, title 'Tokens' in top left, source indicated at bottom right.

We need to deal with the context window. For that, we first need to understand: What are tokens? Tokens determine the size of the context window. Tokens are basically nothing more than numbers. These Large Language Models or Foundation Models, as they’re more commonly called today – a brief digression on why we no longer say Large Language Models when referring to the really big ones: “Frontier Models” is another term, because they no longer just deal with or focus on language. Language is one facet. They’re actually autoregressive models, that’s the technical term.

Everything that fits well into tokens and into the underlying Transformer architecture, they can process. Language is one of them. Video streams are something else, audio streams also fit well into tokens and the Transformer. That’s why we see Suno and such things, where I can participate with music. Or I can connect my construction site livestream to a model and say: “Please write me a list of all the safety issues on this construction site and categorize them by severity.” That has nothing more to do with the Language Model.

When I sit with clients and ask: “Have you thought about some use cases yet?” and they say: “Yes, summarizing texts.” Then I always think: Yes, that’s one of the first use cases and of course they’re great at it. But they now have a sea of use cases and they’re taking a glass of water from the very front.

Brief digression on what tokens are and why these things aren’t always called LLMs anymore: Tokens are basically numbers that represent something in these gigantic neural networks. If I now want to process language or text with such a neural network, I have to chop up sentences and texts. If I were to process whole sentences or stories or paragraphs – you can all imagine, that’s not how we humans do it either. That’s somehow very unwieldy, it’s inflexible and insanely inefficient.

So we have to find a cutting size that’s a good trade-off between whole words or maybe a bit below. In English in this example – the questions I asked at the beginning, you can look at with Simon Willison’s GPT Tokenizer. He displays it graphically, and I think he also uses a standard embedding model to create these tokens. Many are whole words, but for example “GmbH” gets totally chopped up. “Snobby” gets chopped up. Quotation marks are also a separate token.

I think you can see quite well why that is. Because a Large Language Model, to stick with text, actually has to abstract language. And that’s also something great that we’ve never had before: We now have an abstraction of knowledge. A language-independent one. How many centuries and millennia have we been packing our knowledge in a language-bound way on papyrus rolls, wiki pages, Confluences? It’s always in one language or several, but I have to manifest it in a language. We now have a technology in which this is language-independent, because these tokens are language-independent.

That’s the only reason why these models work so well when I say: “Answer this question for me in Chinese.” Or I ask in Chinese and want German or say: “Invent a dialect between Chinese and German.” So they need to have a modular, easy-to-handle, and as efficient as possible way to form things for us – and transformers and tokens are just what’s currently best for that.

Text on white background: 'Your prompt' (large and bold), followed by '(and every other message) need to fit into the context window.' in smaller font.

Back to our condition: Our prompt with all our knowledge and every other message, if we’re talking about a chat – we don’t always build chats, maybe we’re also building AI-supported features where I might not chat at all – must fit into the context window.

Text on the left side: 'If not: chunk the corpus' with strategies like simple approaches and maintaining context. Right: A humanoid robot stands at sunset on a stack of documents by the sea.

If they don’t, then I have to chop up the corpus, I have to chunk it.

The so-called chunks are a relatively important concept in the RAG architecture. You all know what a chunk is. When you search on Google for “great restaurants Berlin,” you get ten results and those are chunks. That’s a headline with a bit of text, and the text is usually kind of an excerpt where Google has already marked what will probably give me the most help in the overview, when I’m just scanning, to click on this link or choose another. Chunks are nothing more than text snippets that need to be as relevant as possible – and relevant is the big magic word here.

So we need to chop up our corpus where all our architecture documentation is. Let’s take Confluence, it’s a common example. Confluence pages with meeting minutes, ADRs maybe, arc42 documentation, maybe another data source comes in, namely the source code or something. So I have to cut up my whole corpus that’s relevant for my feature or my product – and how do I approach that?

I have to eat the elephant, need to start with something. I always recommend starting with gut feeling. I need to look at the data, we can’t get around that. We need to know our data. We don’t have to look at everything, but we need to develop a feeling for how they’re roughly structured. I’m not even talking about bringing them all into the same structure – I have to do that with machine learning. But with generative AI and Large Language Models, it’s actually nice that I’m not so dependent on structure.

But I have to chop up this corpus, then I have to decide: Do I take sentences, do I take paragraphs, do I take chapters, do I take pages? I have to start with something in order to iterate over the problem. If you’re cutting these chunks – you might decide on paragraphs – then start and then you have to see: How well does it work now when I give these chunks to the LLM, how good are the answers? If it still doesn’t fit in the context or too much nonsense around it is cut out or too little, then you have to adjust the chunk size. That’s actually what you iterate over most of the time when building a RAG system.

If I’ve now decided to take sentences or paragraphs, for example, then I have to paste the chunks into my prompt.

Heading 'Add relevant chunks to the prompt', question about the software architecture of a shop of the 'Snobby Wine Connoisseurs GmbH', followed by two text blocks: 'Our E-Commerce shop architecture…' and 'Architecture modernization workshop minutes…'.

Copy-paste. The simplest RAG – you’ve set up a RAG architecture in 5 minutes. It can also be manual, it can be me. I can paste the text snippets into my prompt, then I am the RAG architecture.

Don’t get distracted by any XML text. You can structure it in the prompt however you want. You can just send it with a blank line in between. You just have to tell the LLM: “Look, here come the chunks now.” Usually they figure it out anyway, but the intern analogy is always good: Here’s the stack of documents, solve this task and look in there. If we do it mechanically, then someone from us won’t be sitting there and pasting these chunks in every query, then we could just do XML. There are models that are happy about it – the Anthropic models in particular. But in the end, they don’t really care either.

White background with the text 'Wait. So RAG is basically just about adding stuff to my prompt?' in dark green and 'Yep.' in orange.

At this point, the question often comes up: “Wait, is RAG really just that?” Yes, it’s exactly just that. It’s an architecture or a technique that pastes stuff into my prompt. So back to Google: You may have noticed, Google also does RAG on a very, very large scale. We probably won’t do that, depending on what we’re working on. But so far it’s only been rolled out in the USA, these AI overviews.

That can sometimes be problematic. It’s probably also about chunk sizes. If the first ten search results on this question are nonsense in their relevance, or if the chunk size cuts off the important part – you can do it, but it’s unhealthy – then this AI feature, depending on how it’s prompted, can generate nonsense. But even there: That’s not the last lever. If I write in the prompt “please use your common sense and don’t tell people nonsense,” that kind of stuff shouldn’t happen either. No one knows what they’ve been doing there, but these are real screenshots and these are problems that happen.

Text image with the heading 'Retrieval-Augmented Generation' and the subtitle 'Now: A technique for grounding LLM results on verified external information'.

So let’s say “hello” to this architecture and take a look at it. RAG is basically nothing more than a technique to ground LLM answers on verified external information – “grounding” is a popular word that has caught on. I need to support my thesis, my statement.

Title: '2005 Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks', below the URL 'https://arxiv.org/abs/2005.11401' and the author 'Lewis et al'.

The technique is actually relatively old. It originated in 2005 in the paper by Lewis and others: “Retrieval Augmented Generation for Knowledge Intensive NLP Tasks.” No one was talking about Large Language Models yet.

Text graphic with the words: 'retrieves your stuff', 'augments your prompt', and 'the LLM generates your answer' in different lines.

And today it’s actually used for that: It gets our stuff, whatever is in our Confluence, whatever is somewhere. That spices up our prompt, and the LLM is actually just at the back end and generates an answer based on this. That’s RAG.

Flow chart with steps: 'How Do I X?' leads to 'Retrieval', then to 'Vector Search and/or Full Text Search' or 'Augmentation', and ends with 'Generation'. Text '20,000m view' bottom right.

If we look at this from an architectural perspective – like a helicopter flight, 20,000 m high (do helicopters even fly that high?) – such a high-level bird’s eye view, then it basically looks like this: At the beginning is my prompt, that goes into a retrieval system. Retrieval system is not a new term, we’ve been using it in computer science forever. Search functions, for example, are retrieval systems. And here it’s often about search functions with RAG.

The retrieval step gives our entire prompt or a part of it to a search – a vector search or a full-text search. That returns things, then it goes into augmentation. The results of the retrieval system are inserted into the prompt, and at the end is the black hole – hence the LLM. We actually have no control there anymore, we can’t modify anything. We’re kind of at its mercy. When our prompt is in there, that’s it. It never comes back out, then an answer comes out.

Flow diagram of a retrieval process with the phases Retrieval, Augmentation, and Generation, based on vector-based and full-text search. Text: '500m view'.

If we fly a bit lower, then we see: In the retrieval step, the complexity of this architecture seems to be hidden. The other steps are relatively trivial, not much happens there anymore. So we actually need to look more closely at what happens here.

Flow diagram of the retrieval process: starts with 'Prompt: How Do I X?' and leads to 'Retrieval', which evaluates data from a corpus using 'Vector Search' or 'Full Text Search'. The methods use an 'Embedding Model' or 'TF-IDF'. Text below: '500m view'.

In the example, we’d now have two search functions. Why two? I’ll get to that later. Let’s first look at the first one. Who has heard of vector search? You hear about it a lot right now, especially in connection with RAG. If you look on the internet for how to do RAG, 90% of tutorials and articles say: “Well, look, install a vector database, then embed all your stuff and it has to go into the vector database.” That’s only half the truth. I don’t always need a vector database, I need it for a specific use case.

Text with the inscription 'Vectors / embeddings' in dark blue-green font on white background.

So let’s look at that first. For that, we need to digress a bit. Who knows what vectors and embedding or embeddings are? So I need to embed my chunks, my text. Why do I need to do that? These neural networks, as large as they are, work with the semantic similarity of concepts of things. And I need to somehow teach that to the machine.

Diagram of the concepts of Embeddings and Cosine Similarity with vectors in 12288 dimensions, examples like bird, cat, and fruit images, and a scale from -1 (opposite) to 1 (identical).

Now as an example: The leading AI labs don’t release any more data on how big the neural networks are. With GPT-3, that’s the last data point you could still get from OpenAI – there were 12,288 dimensions. On my slide, that’s totally simplified as a three-dimensional space. You can imagine that super well, but a space with 13,000 dimensions is a bit hard to imagine in your head if we want to understand embeddings now.

Let’s stay in this – fifth grade, sixth grade – three-dimensional geometry model. I could now have things: The yellow lemon, the yellow bird, the cat down there, and then at the very bottom the bird with the white body and the red head. These are concepts in the neural network, the world model. There’s so much more in there and maybe my own things. They’re not in there, but I want to somehow bring them into context.

So I need to know in the vector search how similar things are or how dissimilar they are. A so-called embedding is basically such a vector. The yellow lemon is a concept. And now there are different methodologies for how I mathematically map the similarity between things – and cosine similarity is the best known, so I’m only going into that. There are two or three others, but that’s the most common, which is actually always seen, which is absolutely fitting for most use cases. Others are optimized variants for slightly different cases.

Let’s just look at that: Cosine similarity is basically nothing more than the measured angle between two vectors, between two concepts, two meanings. If the angle is smaller, then they are more similar than if the angle is large. So the yellow lemon is relatively similar to the yellow bird, the green wings, and the green fruits – so a conglomerate of concepts, they’re not individual things. At least more similar than the yellow lemon to the cat down there. That’s why the angle between them is bigger, or between the yellow lemon and the bird at the very bottom, it’s relatively large.

With cosine similarity, we now go and measure the angles and say: 180° is virtually the complete opposite, then they point in two different directions. The blue car and the yellow lemon. In real life, it’s not that simple, nothing is completely opposite. Everything is somehow a bit similar. And that’s perfect for this, because almost nothing is completely opposite and almost nothing is identical, not even twins. That’s why we see, when we convert our chunks, our text snippets into an embedding, i.e., a vector, almost never any identical or exactly opposite things.

We now go for the angles and say: 180° would be -1, 90° is somehow neutral to each other, 90° or 270° is a 0, and identical would be a 1. What you see in real life are floating point numbers that are exactly punctuated on this scale.

Text on white background: 'An embedding is a vector that describes semantics', where 'embedding', 'vector' and 'semantics' are bold.

What does it look like when I run our text snippets through this? Once more for understanding: An embedding is nothing more than a vector that describes semantics, meaning.

Table with title 'Chunk embedding'. Left column has text fragments like 'ADR for SCS verticalization...', right column has corresponding numerical embeddings like '[0.044, 0.0891, -0.1257, 0.0673, ...]'.

It helped me to just do it. I always give this tip to everyone: If you have a Postgres, for example – most of us do – even that, I don’t need more, any relational database. I could go and introduce a column in my table of blog posts or insurance contracts or whatever that I call Embedding. I can store an array of floating point numbers in there.

That means I could take a row from my table – if it’s appropriate, for a blog post it would be the title, the subtitle, the author, the date, and the text – if I concatenate those and run them through an embedding model, something like this comes out. And the array is exactly as long as the embedding model specifies the size. They’re always the same length, but the values in there are always different.

That’s actually an embedding. And if I have a Postgres, I can install the pg_vector extension and then I can just try it out. Then I could just embed my blog posts or whatever I have, fruit types in my database, with such an embedding model. It takes a bit, it’s relatively computationally intensive. You don’t just do this in passing, you don’t do it in the request-response cycle either – it doesn’t take forever now, but it is a long-runner.

Then I look at it, save it in there, and then I can use SQL – I won’t show this in this talk, but you can find it relatively quickly – the query also exists as an operator where I compare things. I put something in, for example a query: “Find me all blog posts that deal with Java architectures that were en vogue 10 years ago and where there wasn’t Spring Boot yet.” That could be such an insanely fuzzy search query. If I throw that into my SQL query and say “please look at the Embeddings column,” then I get results back. Those are probably relatively relevant blog posts that match this enormously inflated query, and that’s exactly what vector databases enable for us.

Diagram for vector search with an orange circle, labeled with 'Query', which is connected via similarity values like '0.92' to four symbols. Text: 'Simplified, a vector space is multidimensional'.

I’ll show again, simplified, how it works: In these many dimensions in this vector space, there’s a floating point number and it represents one aspect of the total meaning. That means our blog posts or our chunked, our embedded chunks are mapped like this. But we can’t visualize it well, because 2D and 3D are still doable, after that it gets difficult.

The vector search just takes this example query – the insanely inflated, fuzzy one, where I might not even be able to decide what I’m actually searching for, but I just put it in like that, I might have dictated it while walking. The vector search searches and returns results to me, in our RAG case, the chunked Confluence content, and finds chunks. Each of these chunks or each meaning has this cosine similarity as a floating point number. It’s extremely simplified here: That one is far away, it has a certain distance to my query. Then there are other chunks that go closer to 1, closer to identical. They’re closer to what I asked. Then there are others that are around 0.8 and so on.

That’s how you have to imagine this space: In the middle is my question and around it are my chunks, and they are differently relevant to this question. The cool thing about vector search is that I can formulate insanely complex questions. I can speak freely, I can ask in Chinese, I could also mix in some Dutch or something. It doesn’t matter because it gets embedded.

Text graphic with the title 'Vector search' and bullet points: 'Always returns results: no empty result sets', 'Results based on similarity, not exact matches', 'Works across languages thanks to semantic similarity', 'Finds relevant content despite different formulations', 'Quality heavily depends on embedding quality'.

So a few characteristics of vector search that you need to know: Because nothing is somehow opposite, there are actually never empty result sets. You never see empty queries, empty result sets with vector search. They always return something. And there can always be relatively many results – so if the search is extremely specific, it doesn’t mean I only get one back, I might still get six, seven back. I just have to look at the scores then.

They’re rated by similarity and not by exact match, as we’re used to from full-text search. As I said, it works across languages, and the cool thing is: If I can’t think of a term and I describe something, so relatively human – what was my query again, you still have it in your head: “en vogue frameworks in Java before Spring Boot.” That’s all not so insanely precise, in the end I might be missing a year or something. That works. I could also say “give me places to eat” if I can’t think of the word “restaurant.” You can all see that a full-text search would actually already drop out there. The fuzziness is actually the great power of vector search.

It depends heavily on the embedding quality what we get back. Now I could say, in my RAG system someone enters this question from the beginning: “How did our software architecture come about? Who were the decision-makers?” Now I could say: Okay, the similarities, I need to get them up, then my results will be more relevant. That’s such a well-known misconception. If I increase the cosine similarity, then I have more relevant results.

Slide with the title 'Query rewriting'. Hypothesis: 'Let's rewrite user queries using an LLM to find more relevant chunks.' Reality: Three points – More content in the query can match more parts, chunk ratings increase, but answers don't get better, LLM needs context to write context.

I can achieve this by inflating the query, by rewriting it. For example, I could put an LLM in front and say: “Look, someone just asked this question, enrich it, inflate it a bit, what does he probably mean? What else might he want?” Then suddenly the probabilities increase. But in the real world, the results aren’t more relevant. They’re just mathematically more similar.

This is a popular technique in RAG systems, rewriting or supplementing the query. But I need to have context that is relevant. If I don’t have that, it will be invented and then we’ll end up in the nonsense cloud again. If I have a user profile, for example – Google Maps, Google Places API, they know very well what Robert has been searching for in restaurants for the last 20 years. So I would be surprised if they didn’t know that. They have a user profile, and a relatively extensive one about me.

Slide with the title 'Query rewriting'. Shows the original search 'best restaurants berlin mitte' with an arrow to the rewritten query that includes details about the user and their preferences.

If I now type into a new product at Google Maps – which currently doesn’t exist, at least not for me – “I want something like this here” – instead of “the best restaurants in Berlin Mitte” – we, who are developing the system, could decide: Okay, Robert is asking for the best restaurants in Berlin Mitte. But we actually know quite a bit about him. We know, for example, that he really likes classic traditional Italian food and always enjoys great restaurant interiors, maybe with century architectural style, and he’s not such a fan of stinky natural wine – in Berlin, that’s a relatively important criterion when looking for restaurants.

If I know that: Do it, inflate the query! People are writing-lazy. They’ve been conditioned over decades to enter things like this in search windows: “Best Restaurants Berlin Mitte.” “Best” – you can’t find a better example than “Best,” because “Best” is totally context-dependent. For you, “Best” is something completely different than for me. We’re relatively similar, I think, we found out in the preliminary conversation, but in this case: If I have the data, I need to inflate the query. That’s a great mechanism to get better results and not rely purely on chunk relevance.

A well-known misconception is that when you build RAG, you always have to use a vector search. That’s nonsense. What reality actually shows and what we implement for clients is almost always a hybrid search, where actually two retrieval systems are used: A vector search and a full-text search, because they both have different strengths and weaknesses. That’s why we actually recommend that.

Text on white background: 'Common misconception' written small at the top and below in larger font 'Vector search is the default'.

But it depends on the feature. If we stick with Google Maps – people will still type “Best Italian Restaurant” there for a few years, they’re just conditioned to it. But if we put a magic wand next to it and say “This is a cool AI search, here are some examples of how you could also search,” then the first users will start to formulate their queries a bit more specifically.

Flow diagram of the retrieval process with 'Vector Search' and 'Full Text Search', based on 'Embedding Model' and 'TF-IDF', with data from a 'Corpus', labeled with '500m view'.

I can serve both use cases with a hybrid search. If I don’t have both use cases, it’s still sometimes cool to have both. I would even say, in the majority of cases, because vector search is extremely good with these fuzzy, “vibe-based” queries and full-text search is extremely bad. But full-text search is extremely good with specific queries, for example “ADR 25.04.2024 Software Architecture” or something like that. The vector search will give me six, seven, ten results and the full-text search, if well configured, one – and that’s what I want. I don’t want to pick that out from the vector search results.

Recommendation: Hybrid search with bullet points explaining the benefits of vector search for fuzzy queries, the complementary nature of full-text search (FTS), and their respective strengths.

If I now have both, and they work completely differently, these are different worlds, I somehow have to unite the results.

Then you take the so-called rank fusion, a relatively simple mathematical algorithm. I sort by rank position and not by their scores, because I can’t compare the scores. The score in the vector search is the floating point number and the score in the full-text search might be an integer number that increases. I can’t compare them like that, that’s nonsense. So I actually have to take the rank, the order, the actual, the logical one and unite them.

Diagram with the title 'Rank Fusion', showing on the left 'FTS results' with A and B and on the right 'Vector results' with C, A, D and B.

It looks like this: I now have the full-text search, which has delivered two chunks to me, A and B. The vector search four: A and B are also included, but A is not in first place as in the full-text search and B is at the bottom. But it also has C and D. Could be good stuff in there that the full-text search might not have found. I then have to unite them through rank fusion.

Diagram on the topic of 'Rank Fusion', showing FTS results (A and B) with vector results (C, A, D, B), connected by bidirectional arrows.

Then you just take the order in each list. So, and then I have my four results. In reality, you’d have to solve that algorithmically.

With that, we’re actually through with the complex retrieval hole in the RAG architecture. We have to learn relatively a lot new, but we can also super apply our knowledge of FTS systems, full-text search. That’s nothing new for all of us. We have insanely a lot of competence in it. And that’s good, because it increases the quality of such a RAG system.

Where are the limits, the boundaries of such a retrieval system? I can’t expect that in a RAG system I can let users say: “Give me all orders from the last calendar week.” That’s an aggregation. That won’t work with the retrieval as we’ve discussed it, not reliably. I can be lucky. I also can’t say “group something for me” or so. That’s meant for something else.

We’ve even built something like that before. Then you go and put an LLM right at the front, where a RAG architecture is actually at the back, and say: “What does he want now? Determine the intent.” Is he asking for an aggregation, then please build an SQL query and just query it and give it back to him. Or is he trying to find some cool restaurants in Berlin and can’t express himself properly? Then please go to our retrieval. At the front, like a router, you have an AI that basically decides what is probably wanted there. For the user, it’s not transparent, they get results for their question in the chat. But we have to make that dependent on the use case.

Diagram of a workflow for processing a request with 'Retrieval', 'Augmentation', and 'Generation', based on vector search, full-text searches, and a document corpus.

We’re now leaving the retrieval and going into the augmentation and generation phase. Now we have our chunks, the three that came back – before there were four in the example, here we only have three simplified. They’re sorted. What does the prompt look like that goes into the machine?

Text image with the title 'The final, augmented prompt'; describes an AI assistance for software architects developing e-commerce systems for exclusive wine shops.

That’s actually – parts of the prompt are from the production system for a client. It simply says – up there is the system prompt, for example, that wouldn’t be in the feature prompt, but now simplified all clapped together: “You are an AI assistant and you help software architects who are working on an e-commerce for a high-end wine shop in their daily work. Your role is to provide accurate, relevant, and helpful answers to their questions with the context you receive.”

You really have to pre-chew something like that – intern analogy. “You receive context in the form of ranked chunks from a retrieval system. This context contains relevant information to answer the architect’s question. Here follows the context.” Ignore XML. You can format it however you want. You just need to say: “Here comes the context now,” so nothing gets mixed up.

Excerpt of a code with architecture decision document and tech stack evaluation. Details such as source, date, author name, and context on SCS architecture and selection of Java Spring Boot with PostgreSQL and React included.

This is what it can look like, doesn’t have to. If I have URLs, I put URLs in there. It’s nice, then I can directly link the answers in the response. If I have a system dealing with PDFs and pages that lie somewhere on file systems, I obviously can’t link directly, then I have to do it somehow differently. I just give the user: “You can find it in book ABC on page 13 chapter 4.” Here is the Confluence example, there I can take URIs. These are the ranked chunks.

English text with a question to software architects about the development of the architecture of an e-commerce shop at 'Snobby Wine Connoisseurs GmbH' and instructions for answering.

“The software architect has asked the following question,” then paste in the question. “To answer this question, proceed step by step. First analyze the context. Then identify the relevant information to answer this question. Formulate a comprehensive and accurate answer.”

And now it gets really interesting: “Make sure that every statement in the answer you formulate is covered by at least one source, and the sources must be the chunks. And attach behind every statement a numerical reference to the source that supports your statement.” You’re basically teaching it journalistic work through this prompt. “Don’t give any info that’s not verifiable based on the sources.”

Text on white background describes reference requirements: 'References in format [id:document_id,page:pagenumber]' should be placed directly after statements. Each paragraph needs at least one reference, each statement needs a reference.

“Important!” – then shout again. “Please note” – you see all this in the production system. “Please, please, please remember to formulate your answer in Markdown.” You can find this in the Anthropic System Prompt. It literally says “please, please, please” in English. “Use references for the sources throughout in the format, always put the references behind the statement where the source is used, and every paragraph must contain at least one reference and every statement must contain a reference.”

This is what the prompt looks like. We put that into the generation, then we get an answer. I’ll spare you the examples now of what something like this looks like. You know it too: If you say in ChatGPT – you can now turn on a search function, click the globe, usually it recognizes your intent and throws on the globe itself – “Give me cool restaurants,” then it uses such a search index. That’s also RAG and makes us these little dots behind its answer, which I can then click on. Those are the sources.

Text image with the label 'RAG Challenges' in large font and below 'Some learnings from customer projects' on white background.

What have we learned in customer projects?

Text graphic with the heading 'Chunking is hard' and the points 'Too small, context is lost' and 'Too large, retrieval fails', describing the challenges in chunking.

Chunking is the hard problem. The chunks need to be relevant, the chunks need to contain the info that counts as evidence for an answer. That means I have to cut the chunks correctly, otherwise my retrieval system is not good, and if the retrieval system is not good, everything behind it is actually lost.

If the chunks are too small, context is lost. I gave an example earlier: Then the justification is missing, has been cut away, is in another chunk three chunks later. And if that can’t be found by the search because the preceding sentence is missing, then it won’t appear. If they’re too large, the retrieval also fails, because then I’m actually looking for a needle and always delivering a whole barn, and of that 10 or 15 or so.

Text on white background: 'Information distribution' (bold) and below 'Key facts scattered across multiple parts, hard to combine.'

Information distribution is difficult. If the core facts are distributed across too many chunks, they’re hard to combine.

Slide with white background. Title in teal: 'Query formulation mismatch'. Subtitle: 'What users ask vs. what's in the chunks'.

Query formulation: What does the user want versus what do my data actually provide? If I put my Confluence in there and say “answer questions about software architecture,” and there’s not a single ADR in it and an arc42 documentation from 1950, then I can build the best architecture, it just won’t work if the data is garbage.

Solution: Contextual search. Uses LLM to generate helpful context within sections, instead of relying on rigid pre-cut parts. URL at bottom right: https://www.anthropic.com/news/contextual-retrieval.

A solution for these problems is relatively new, the Anthropic researchers – so the ones who also develop the Claude models, they still do really relatively cool basic research and share it – they developed Contextual Retrieval, that’s from November, it’s not that old. They go and use an LLM to formulate the chunks coherently.

When I chunk my Confluence, things are missing at the front and back. That will always be the case. If I form whole sentences, then I have a sentence that is semantically complete at first, but if the following sentence actually belongs to it to be truly relevant, and it’s in the next chunk, I have a problem. And I can only approach this problem approximately to solve it. I’ll never find the perfect chunk size.

They go and say: When you’ve cut your chunks, then you run an LLM over all chunks, whatever iteratively works through the entire corpus and formulates all chunks coherently. And through that, I have a bit of information duplication, maybe not, but each chunk is basically coherent. I basically inflate my chunks like the query earlier. They published a paper on this and it supposedly ensures that a significantly higher result quality can be achieved if the prerequisites are right and if it’s difficult to find a uniform chunk size for my data. I have to find one. I have to decide on the best approximation, but it might not be perfect. If it’s not, Contextual Retrieval is the solution.

Text graphic with the inscription 'Alternatives to RAG' on white background.

What are the alternatives to RAG?

Comparison table with 'RAG' (stores model, low storage costs, transparent sources, easy updating, no model weight change) and 'Fine Tuning' (trains model, high GPU costs, black box answers, fixed data after training, changes model weights).

You’ve probably heard of fine-tuning, right? That’s basically taking the finished models and giving them another training step at the very end. I could now think: Okay, why don’t I train my Confluence into my Llama 3.3 70B? Then it’s in there, why do I have to put this whole retrieval stuff in front, someone has to maintain it again, the two searches, they log again, a box has to be set up for it. I could just fine-tune. I do it once and then the stuff is in there.

And that’s exactly the problem. I don’t just do that. It’s not forever either, but it’s relatively expensive, because I obviously need GPUs to attach such a training step to the end of a model. I probably won’t use the very smallest models, if I use Open Weights models at all. OpenAI and Anthropic both also offer fine-tuning for their closed models. So you can do all that with them too, but it’s long-running, it costs a bit.

And then I basically have knowledge cutoff again. That means, if a new ADR lands in Confluence, someone has to start the fine-tuning again. In the retrieval system, it’s in there in real-time, i.e., when it lands in the index. Not in real-time, but with much shorter deadlines. You have to weigh that up. If the Confluence only changes every blue moon and we’re talking about an architecture of the mainframe where no one can touch it anymore or something, that could be an option.

But what I also do with fine-tuning: I obviously influence the model weights, and what you often see is, if you fine-tune any nonsense in there, then the model also forgets high-level language, then it starts to babble strangely and things like that. So I actually need to know a bit about what I’m doing there. And I have to make sure that the data is really good.

Yes, and it costs a bit, although the factor – OpenAI in particular seems to be enormously convinced that this is good for many customers, which is why they give gigantic discounts on fine-tuning, which makes it really cost-effective. They want people to try it because too few people simply do it. Because statements like that are always thrown at people: “Oh, it’s too expensive, then the knowledge cutoff is back, it takes too long and the thing babbles nonsense afterwards.” That can happen, but doesn’t have to, and it’s also relatively well solved by OpenAI. You have to decide, it’s a trade-off.

An example where fine-tuning is brilliant is, for example, image fine-tuning. If I deploy a foundation model that is supposed to create marketing flyers in the brand language in the corporate design of my company and is supposed to vary them and always be on brand – white space has to be right, color choices, fonts and everything – then image fine-tuning is brilliant, I can’t achieve that through prompt engineering. Fine-tuning is great for that.

Large law companies also do fine-tuning of their models with legal texts and lawyer-speak. People always say that would be such a fine-tuning case – I’m not so sure if it’s that simple, because purely through prompt engineering I can actually already teach a model what a legal text should look like.

With RAG, it’s just storage costs. I have to keep these indices somewhere, the vector index and the full-text index. It’s relatively easy to update. When embedding, of course, consider: If someone creates a new Confluence page, you can’t just instantly create an embedding, it runs a bit – probably a few minutes to half an hour you do it somehow later or so, but it’s all not like with fine-tuning.

Diagram with the title 'RAG vs. Agentic Workflow', showing the flow from prompt to retrieval, augmentation, and generation, including vector or full-text search.

We’ve all learned, this year the agents are coming. That’s why many talk about agentic RAG. I always include this so that you roughly know what changes there. Because I could say: “Here, dear agent, do a research and give me your result this evening” – such a long-running task – “and decide for yourself how long you search, and just return when you think it’s good.” Those are agentic facets.

No one knows exactly how to define agents. Right now, everyone is starting to define agents, what that actually is. Let’s assume it’s something like that. It could be a good use case. How does the RAG architecture change through an agent? It actually only changes by introducing an outer loop, namely from the generation step back to retrieval.

Diagram with workflow steps for 'RAG vs. Agentic Workflow', from input 'Prompt: How Do I X?' via 'Retrieval', 'Augmentation', and 'Generation', with iterative feedback and tool integration.

The agent looks by entering the query, what do I return? Or the LLM – that’s actually another LLM: What does it return and evaluates that?

I need to go back to the query. Then it goes in again. Context might be missing, I might need to use tools. What are tools? Tools is simply when an LLM triggers a web search or calls an API or books a hotel – that kind of thing is tool usage. The agent autonomously decides to use a tool. If I tell it: “Look, besides the retrieval system, you could also tap into our API with ADRs. Take a look in there,” then it’s tool use. And the agent just does this as long as it thinks my result is good. I knock on the boss’s door again: “Here’s my result.” That’s agentic RAG, an outer loop is introduced. In the best case, they don’t buy anything in between.

Text graphic with the statements: 'How to build a good RAG search?', 'Build a good search, then figure out RAG.' and 'UX: Serving both carbon and silicon users.'

How do I build a good RAG search? Many people ask that too. And you’ve probably already seen it in the talk: You actually have to build a good search. If the search is garbage, then the RAG architecture doesn’t solve the problem. That’s why you have to build a good search, and how do you know what a good search is?

You all know it, you all suffer like me from Atlassian’s Confluence search. That’s obviously not a good search. I’m going way out on a limb now, because I stand behind it a bit too. That thing never finds what I’m looking for, and I already enter exact, really exact keywords. It’s so bad, I always have to enter the exact page title, then I get the result, otherwise it’s at position 20 or so.

A good search is good when it works for the users of the feature or the product. And if it is, it works funnily enough just as well for the silicon users. That’s actually the nice thing. Because this human analogy is particularly nice here: If my user tests show “Wow, people find what they’re looking for,” then I don’t need to worry so much about the RAG architecture and the LLM behind it.

So, that’s it. We still have a little goodie: We’ve released a small booklet that you can download as EPUB and PDF, it’s a quick introduction to the architecture, a bit more than this talk. If you’d like to have it, you can scan the QR code. We don’t track anything, we don’t collect any data. You can just download it. You don’t have to enter anything either. If the slides are shared, you’ll probably find it there again.

I’ll leave the slide up a bit longer because I don’t really have anything else, except to point out a case study that we – well, we make various RAG architectures for clients – they’ve allowed us to talk about it, the company Sprengnetter, they do real estate valuation.

Case study title: „Answers instead of search results: Sprengnetter taps into real estate expertise with generative AI

If you’re interested in such a real world case, you can read about it there. Short version is: It was about real estate, their domain is real estate valuation. And I always take this insanely catchy case: If now a real estate appraiser, a broker goes to them and they’re supposed to evaluate a new commercial property on site that is to be built in direct proximity – 500 m to the nuclear reactor. What is the land worth? What can I charge for rent? What is the value of this property?

Then they can ask this assistant. They could also take the internal knowledge they produce – they have all their broker real estate valuation knowledge in PDF books and I can license those for money and provide them to my brokers. But the answers are in there, there are also mathematical formulas in there, but if I want to know something on the run, I don’t wade through books, then I ask a colleague who knows better than I do. They’re pretty happy with the assistant because they can ask exactly that kind of thing. They can also ask completely different things – I’m not so immersed in the domain – but it then delivers the answers back. Because they’re PDFs, it always says: “You can find it on page 13 chapter 4 in the sentence and here is the excerpt.” That’s a nice real case where it brings them real added value in everyday life.

Otherwise, I look forward to your questions. Thank you.

Blog Post

RAG: The Architecture of Reliable AI

Annotated Talk+ for later reference

TAGS