A Simple Worldview
In RAG (Retrieval-Augmented Generation), the goal is to find the best sources for a given question from a large set of data, such as PDF documents, and pass them to the LLM for processing. Empirically, it has been shown that the LLM is better grounded with RAG, meaning it is less prone to hallucinations, stays more grounded in reality, and appears to focus only on the data provided.
I naively assumed that the LLM would read through the data I supplied, understand it, and generate an answer from it. It’s an intuitive mental model.
Until a small detail emerged.
The Problem with Formulas
In our RAG project, an AI assistant searches through 33,000 pages of technical literature in PDF format, selecting up to 35 pages to feed into an LLM. These documents contain many mathematical formulas, such as the following:
The LLM outputs formulas in LaTeX format, embedded in Markdown, and the assistant transforms them into MathML, which looks nice in a browser. So far, so good—no reason to challenge my worldview yet.
The LaTeX formula:
In the browser:
Here’s the catch: PDF is a format designed for printing and viewing, but it is terrible for machine processing. The formulas are embedded images. A PDF parser [1] only reads text. Let’s look at the actual data extracted from the PDF and passed to the LLM:
Try projecting this bizarre, garbled string back onto the original formula. Like me, you’ll quickly conclude that it’s impossible to reconstruct the formula from this mess. You’ll also notice that the browser displays B0
instead of K0
, and the sequence of characters makes no sense [2].
To get to the bottom of this, I chipped away at the text, and to my surprise, the formula remained intact for quite some time.
What kind of magic is this?
It’s all an illusion. The LLM doesn’t read through or understand anything. LLMs don’t process text from start to finish like humans do, nor do they remember what they’ve read.
Speed Intro to Transformer Networks
To demystify the magic, we need to understand the mechanics of inference in Transformer networks (Transformers for short), specifically how they use context and the attention mechanism, at least at a simplified level without the math [3].
Step 1: When text (a prompt with or without data) is passed to the LLM, each word[4] is encoded into a mathematical vector. This vector has hundreds of dimensions[5] and represents the general meaning of the word, as learned by the LLM.
The set of all vectors (representing words) is called the context. Words have different meanings depending on the context in which they appear[6]. This contextual meaning is calculated in the next step.
Step 2: The attention mechanism applies the context to itself (self-attention) to adjust the meaning of all words. Words “tug” at each other, influencing each other’s meanings. This is achieved using mathematical methods that operate on matrices and vectors. The matrices represent the trained knowledge[7].
The result is a modified context—all the vectors have shifted, and the words have a new meaning.
Step 3: The LLM calculates exactly one word (as a vector) that best (most likely) fits at the end of the context. The context grows by one word.
The process then repeats from Step 2 and continues until the end becomes most likely, and the output is completed.
Ah-Ha!
From the mechanics of Transformers, the following becomes clear:
- An LLM can only produce words and patterns that it was trained on. What isn’t in its matrices cannot appear in its output.
- The LLM does not reproduce our PDF data—it uses its trained data.
- Our data serves as context and influences the attention mechanism and therefore the calculation of the next word. The data is not read, memorized, understood, or even learned!
- The term in-context learning can be misleading because the LLM doesn’t “learn” in the traditional sense during inference; it merely adapts to the given context.
So what about the formulas? The LLM already knew them! Our data merely triggered a pattern that existed in its training data.
We cannot teach an LLM something new by providing data in the context. We can only stimulate the LLM to reproduce learned patterns [8] that are relevant to the context. That’s the magic of RAG.
Next time, when crafting a complex prompt, consider which words might be the most effective in the context. What “tugs” harder? Small changes can make a big difference.
The LLM Always Has the Last Word
An LLM operates strictly inside-out. With RAG, we cannot add anything to the LLM[9]. Through external “stimuli” in the context, the LLM is nudged to reconstruct relevant patterns. This feels grounded because the result aligns neatly with our data.
Thanks to vast amounts of training data, a large LLM can handle almost any data found within an organization. The illusion that it reads our data is practically perfect.
The response of a RAG-based assistant is fundamentally constructed by the LLM. The context may nudge the LLM in one direction or another, but it can never provide the structure of the answer[10]. That structure comes from the LLM. The LLM “selects” data that fits its patterns[11].
They 👽 Are Like Us
Interestingly, our own worldview is also largely built from the inside out. Without our learned patterns, we couldn’t understand anything in the world. External stimuli trigger patterns from memory. For example, this is how the illusion of sharp vision is created, even though only a small part of the retina actually sees sharply. Vision is mostly reconstruction.
You’re probably familiar with the effect where you don’t notice obvious mistakes in a freshly written text. Someone else spots them immediately, and so do you after a break. What you think you’re reading isn’t always what’s written. Your brain reproduces the intended text, which differs in small details from the actual one.
The big difference between us and LLMs, of course, is that we continuously learn. I’m curious to see how the development of LLMs will evolve in this regard.
At that point, even LLMs might find their worldview shaken when they encounter a small detail.
-
We use pdfminer.six. ↩
-
I couldn’t figure out why this strange encoding is in the PDF. But that’s irrelevant to the topic. ↩
-
If you want to fully dive into the mathematical fundamentals: Better start gently with ChatGPT before moving on to Wikipedia. ↩
-
Actually, the pieces are tokens, smaller parts of words. But this doesn’t change the understanding. ↩
-
It consists of hundreds of numbers. A three–dimensional vector consists of three numbers and represents spatial (x, y, z) coordinates. ↩
-
Think of “seat.” In the context of furniture, it means something entirely different than in machinery (seat of a bearing). We could continue with “bearing” just as well… ↩
-
These are the weights of the neural network. ↩
-
“Pattern” refers to small structures (words) and larger contexts. ↩
-
An LLM can only be changed with fine–tuning. ↩
-
The data consists of loosely thrown–together, unrelated pages. No structure can be recognized unless you already know it beforehand! ↩
-
You cannot really assign an active and passive role. Context and the LLM form a unit for a limited time while the LLM is computing. ↩