Solving AI's Limited Memory with a Multi-Layered Caching and Context Strategy
Frustrated with your AI assistant forgetting crucial context? I share my multi-layered caching strategy using MCP servers to create a persistent, smart AI developer.
As a developer, I've embraced AI assistants like GitHub Copilot and Claude Code. They're revolutionary, but they have a critical flaw: they're forgetful. An AI's "memory" is limited by its context window. Once that window is full, the AI starts compressing information, and crucial details get lost.
I’d tell my AI not to use a specific component, and five prompts later, it would make the exact same mistake. It was frustrating and inefficient. The core problem is that every interaction, including the massive list of tools (MCP servers) you connect, eats into this precious context window. Just connecting to GitHub's MCP server using the remote URL https://api.githubcopilot.com/mcp/ was consuming nearly 20% of my available context!
This led me down a rabbit hole to build a better "brain" for my AI assistant. My goal was to create a persistent, long-term memory and a more efficient way to provide context, without overwhelming the AI. Here’s a glimpse into the strategy that finally worked.
The Problem: A Leaky Memory and a Crowded Room
Imagine your AI's context window is its short-term memory. You're in a room (the session) with your AI. Now, invite some friends: these are your MCP servers, like GitHub, Figma, etc. Each friend comes with a long list of everything they can do. Before you can even ask your AI a question, you first have to read out the entire list of capabilities from every single friend in the room.
You can see the issue. The more powerful tools you add, the less memory your AI has for your actual conversation and code. It leads to:
- Repetitive Errors: The AI forgets previous instructions and constraints.
- Wasted Time: You spend more time re-explaining things than developing.
- Inefficient Tool Usage: The AI can get confused by the sheer number of available tools.
I knew I needed a system to make the AI smarter, more efficient, and less forgetful. The solution wasn't one single tool, but a multi-layered approach to context management. It started with controlling what tools were even allowed in the room.
Layer 1: The Bouncer - Selective Tool Loading with using Docker and Claude Code Hooks
My first step was to act like a bouncer at a club. For example, instead of letting the entire GitHub MCP server in with its massive list of tools, I used its Docker container version (see https://github.com/github/github-mcp-server/blob/main/docs/installation-guides/install-claude.md#local-server-setup-docker-required). This allowed me to configure the server to only expose the tool groups I actually use (e.g., managing repositories and issues) and leave out the ones I rarely touch.
The Result: The context footprint for my GitHub tools dropped significantly.
I took this a step further. For project-specific tools, like the `Laravel Boost` MCP server which provides documentation for the Laravel framework, there's no reason for it to be active in a Python project. I implemented a `SessionStart` hook in Claude Code. This hook checks the project type upon starting a session, checks if it's a laravel project, and only then sets up the Laravel Boost MCP server for the project if it isn't already in place.
This dynamic loading ensures that context is only used when it's relevant, keeping the AI's "mind" clear and focused on the task at hand.
Layer 2: The Codebase GPS - Semantic Search with Claude Context
Before the AI can remember my decisions, it first needs to understand the codebase it's working in. For this, I use another powerful MCP server: `Claude Context`. This tool indexes the entire workspace using a vector database. It's like giving the AI a super-powered "find" command that understands natural language.
The setup requires an embedding model, which I run locally using Ollama to avoid extra costs and take advantage of the quite decent hardware my MacBook Pro with M4 Max Apple silicone chip and 128 GB of unified memory has to offer, but you can use services like OpenAI's APIs as well. The initial indexing takes time, but the payoff is immense.
Instead of the AI fumbling around with file names, I can say, "Find the function that handles user authentication," and `Claude Context` will pinpoint the exact files and code blocks. This dramatically speeds up any task that requires understanding existing code, making the AI a much more efficient partner.
Layer 3: The External Brain - A Persistent Vector Database with Cipher
Solving the tool-list problem was a good start, but it didn't solve the issue of the AI forgetting conversations and key decisions. For this, I integrated `Cipher`, an MCP server which connects to a vector database. Think of this as giving my AI a searchable, long-term memory. It's been a game-changer.
I configured it to all three distinct "memory" partitions that Cipher has to offer:
- Knowledge Memory: Stores facts and instructions. (e.g., "The `Flux UI alert` component does not exist. Use the `Flux UI callout` component instead.")
- Reflection Memory: Stores outcomes and learnings. (e.g., "Attempted to use `Flux UI alert`, which caused an error. The solution was to use `Flux UI callout`.")
- Workspace Memory: Stores information about the current project's structure and state.
My AI agent's hooks were updated to automatically query this memory at the beginning of a task and save new learnings at the end. Now, when I start a new session, the AI can retrieve past decisions and avoid repeating mistakes. It's not perfect — I still occasionally remind it to save important information — but it has drastically reduced the "groundhog day" effect of re-teaching the AI.
Layer 4: The Smart Summarizer - Delegating Documentation Lookups
The final piece of the puzzle was handling external documentation. Some MCP servers, like `Context7`, can query any documentation. The problem? They return a massive wall of text, which again floods the context window. If I ask for information on Laravel Livewire, I might get the entire documentation page back, using up a huge chunk of my token budget.
The solution was beautifully simple: I installed another MCP server called `ducks-with-tools`.
This tool acts as an intelligent proxy. Instead of connecting `Context7` directly to my main AI, I connect it to `ducks-with-tools`. Here’s the workflow:
- My main AI agent asks `ducks-with-tools`: "How do I create a form with validation in Laravel Livewire?"
- `ducks-with-tools`, using its own API keys for models like Claude or GPT or, like in my case, a local LLM running in Ollama, queries the documentation via `Context7`.
- It receives the huge, token-heavy document.
- Crucially, it then summarizes this document into a concise, actionable answer.
- It returns only this small, summarized response to my main AI agent.
This means my primary agent gets the knowledge it needs without its context window ever seeing the massive source document. It keeps my agent's memory free to focus on writing code, not reading documentation.
Final Architecture
My current setup provides a robust, multi-layered memory system for my AI assistant:
- Dynamic Tool Loading: Only relevant tools are loaded, saving context.
- Code-Awareness via Semantic Search: `Claude Context` helps the AI understand the codebase.
- Persistent Vector Memory: `Cipher` ensures key learnings and project context are stored and retrieved across sessions.
- Summarized Knowledge Retrieval: Documentation is queried and condensed by a secondary service, protecting the primary agent's context.
By treating the AI's context window as a valuable, finite resource, I've built a development partner that is not only powerful but also reliable and consistent. It remembers my rules, understands my projects, and learns from its mistakes—making me a more effective developer.