The Retrieval Problem That Is Ignored
Everyone’s building better memory systems. Almost nobody is asking the harder question: how do you know what to remember right now?
Imagine a library with every book ever written. Floor to ceiling, aisle after aisle. Perfect catalog. Instant search. You can find any book in under a second.
Now imagine you walk in and someone says: “Write a good essay.”
That’s it. No topic and no rules. No sense of what you already know or what gaps you need to fill. Just write a “good” essay, and here’s every book ever written to help you.
You’d drown. Not because the library is bad, but because having access to everything is functionally the same as having access to nothing when you don’t know what you need. It’s overwhelming to try to process.
This is the state of AI agent memory in 2026.
We’ve gotten remarkably good at storage. Vector databases are fast. Embeddings are cheap. RAG pipelines can chunk, index, and retrieve documents at scale. Semantic search actually works now. If you ask “what did we discuss about the deployment last Tuesday,” the system will probably find it.
The storage problem is largely solved. The retrieval problem is wide open.
And it’s not the retrieval problem you think it is1.
Here’s a distinction that matters more than it sounds: searching for information and remembering information are fundamentally different cognitive acts.
When you search, you already know what you’re looking for. You have a query. You type it in. Boom, results come back (and usually some ads). This is exactly what RAG does (minus the ads2), and it does it well.
When you remember, something comes to you. You’re working on a problem, and suddenly a relevant experience surfaces. Nobody queried it and you didn’t provide a search term. Your brain pattern-matched the current situation against stored experience and proactively served up the relevant bit.
The difference matters because most agent memory systems only do the first thing. The agent can search when told to search. It can retrieve when given a query. But it doesn’t remember. It doesn’t proactively surface relevant context based on what’s happening right now.
A developer sits down to debug a deployment issue. Before they’ve even opened the logs, they’re already thinking about the last three deployment issues they dealt with. The one that turned out to be a DNS thing. The one that was a permissions issue after a config change. Their brain is already loading relevant context, unprompted, because the situation triggered retrieval.
Your agent starts every debugging session from scratch. It has the same information stored somewhere. It just doesn’t know it’s relevant until someone explicitly asks.
Every retrieval decision involves three questions:
1. What to retrieve?
This appears straightforward, but it’s not. The term “relevant information” is circular. Relevant to what? The current message? The current task? The broader project? The user’s emotional state? A frustrated user asking “why doesn’t this work” requires different context than a curious user asking the same words.
2. At what fidelity?
Do you need the full conversation from last Tuesday, or just the conclusion? Do you need the raw API response or the summary? Do you need the entire project history or just the last three decisions? Fidelity has a cost. Every token of retrieved context competes with every other token for the model’s attention. Over-retrieving is almost as bad as under-retrieving because it buries the signal in noise3.
3. When to retrieve?
This is the one nobody talks about. Most systems retrieve at a fixed point: user sends message, system searches memory, results get stuffed into context. But that’s like only checking your rearview mirror when someone honks. The best time to retrieve context is often before the user asks, based on what’s happening in the conversation. The task changed. A new entity appeared. The emotional register shifted. These are all retrieval triggers that get ignored because the system only retrieves on explicit query.
Every current memory system I’ve looked at answers question 1 with semantic similarity (which is fine for simple cases), ignores question 2 entirely (everything comes back at full fidelity), and answers question 3 with “whenever the user sends a message” (which misses most of the interesting moments).
In the last post, I talked about the student who highlights every line in the textbook4. That metaphor extends here in an uncomfortable way.
The highlighted-everything student has technically performed retrieval. They marked both the “important” and “less important” parts, as they had no framework for determining what mattered (or were simply too lazy to do so).
Now give that student a study guide. “The exam covers chapters 3, 7, and 12. Focus on the relationship between X and Y. Expect one essay question on Z.”
Suddenly they know what to highlight. The study guide didn’t add any new information but instead it added purpose to the retrieval process. It told them what they’re retrieving for. This is the missing piece in agent memory. Not better search or more storage. Purpose-driven retrieval. The system needs to know what the agent is trying to do before it decides what context to fetch.
Without that, you’re building bigger and faster libraries and wondering why the essays aren’t getting better.
Let’s go another layer deep into where current approaches break down. Semantic search works by converting text to vectors and finding nearby vectors. “How do I deploy to production?” and “production deployment steps” are semantically close, so the system surfaces the right doc. Amazing!
Consider this scenario, you’re debugging a failing deployment, and the actually relevant memory is a conversation from two weeks ago where your teammate mentioned changing the SSL certificate rotation schedule. This conversation wasn’t about deployments; it was about security maintenance. Semantically, it’s distant from your current query, but causally, it’s the answer.
Human memory handles this through associative retrieval. The connection isn’t semantic similarity. It’s causal, temporal, or experiential proximity. “Last time the deploys broke, it was because someone changed something in the security config” is an association built from experience, not from vector distance.
Current RAG systems can’t make that connection because they only know about similarity. They don’t model cause and effect. They don’t track “this thing happened after that thing” or “this problem was caused by that change.” They match words, not experiences.
This isn’t a criticism of RAG; it’s a boundary condition. RAG is a retrieval mechanism, while memory is a retrieval behavior. The mechanism is just one component of the behavior, not the whole thing.
In emergency medicine, there’s a concept called triage. When patients arrive at the ER, they don’t get treated based on who arrived first. Instead, they’re assessed, categorized by urgency, and routed appropriately. A gunshot wound will be treated before a sprained ankle, regardless of who entered the ER first.
Context retrieval for AI agents needs the same logic. Not all information is equally urgent, and not all of it should be processed simultaneously. Additionally, the priority order changes depending on the current task. What is the metaphorical gunshot wound for this task?5
For example, when an agent is writing code, the project architecture and recent commits are high priority. However, when the agent is having a casual conversation, those priorities flip. Similarly, when the agent is debugging, error context and recent changes become the top priority, while long-term project vision drops to near zero.
Static retrieval (same query, same results, regardless of task context) can’t do this. You need something upstream of retrieval that understands what the agent is doing and routes the retrieval process accordingly.
In medicine, triage happens before treatment. In agent systems, retrieval triage should happen before the model sees anything. But almost nobody builds it that way.
I keep ending these posts without giving you the full answer, and I know that’s annoying. But the shape of it should be getting clearer:
The model isn’t the bottleneck. The context is. A frontier model with noisy context will underperform a mid-tier model with clean, purposeful context. We’re spending billions making the model smarter and almost nothing making the context better.
Retrieval is a multi-step decision, not a single query. What, at what fidelity, and when. Most systems handle one of three. The gap between one and three is where quality lives.
Storage is solved. Selection is the new problem. The race to build better vector stores and bigger knowledge bases is important infrastructure. But the differentiator going forward is what you do with all that stored information. How you select. How you compress. How you time the retrieval.
The next step is the question of compression. Because once you accept that not all context deserves the same fidelity, you need a theory of how to compress different types of information for different purposes.
The example of “what did we discuss about the deployment last Tuesday” works because it is specific about a time and topic. Try asking it “what did we discuss about the deployment” when you have more than one.
For now. Looking at you OpenAI.
And rips through your token spend.
If you didn’t read the first post: “Your Agent’s Context Window Is Not a Junk Drawer.” The thesis is that dumping everything into context and letting the model sort it out is the agent equivalent of highlighting the entire textbook. Go read it, I’ll wait.
This is likely a more American example but everyone has seen enough movies to get the idea. If your life is at risk, you get priority over the person who has mistaken indigestion as appendicitis.



