MemPalace: Local-First AI Memory Without the Cloud Bill

Six months of daily AI use equals 19.5 million tokens. That is every decision, every debugging session, every architecture debate. Most of it is thrown away when the context window fills. MemPalace stores all of it.

The Problem

Existing memory systems for AI assistants share a common design: they feed conversation history through an LLM, ask it to extract "what matters", and store the extracted facts. Mem0 does this. Zep does this. The approach sounds sensible. The problem is what gets discarded in the process.

An LLM deciding what to remember will prioritise conclusions over reasoning. It will strip away the back-and-forth, the false starts, the constraints that were discussed and then implicitly adopted. You get a clean summary, but you lose the context that made the summary meaningful. Three months later, when a related question comes up, the stored fact might answer it, but the reasoning behind it is gone.

There is also the cost question. Every extraction pass costs tokens. Every summarisation step burns an API call. Services like Mem0 start at $19 a month, and the token costs stack up quickly for anyone using AI heavily throughout the day. The economics push towards summarising less frequently or storing less detail, which undermines the whole point of having a memory system in the first place.

How MemPalace Works

MemPalace takes a different approach. Instead of extracting facts, it stores the verbatim text of every conversation. No LLM decides what matters. No summarisation pass strips away detail. The full exchange goes in, and the full exchange comes out when you search for it later.

The retrieval layer is semantic search powered by ChromaDB. You query with natural language, ChromaDB returns the most relevant conversation segments, and the requesting model reads the original text directly. The 96.6% LongMemEval score comes from this raw pipeline alone: verbatim text stored as single session documents, retrieved with ChromaDB's default embeddings, no heuristics, no reranking, no LLM at any stage.

On top of this sits the palace architecture. The storage is organised into a spatial metaphor: wings represent people or projects, rooms represent topics within those projects, and drawers hold individual conversation items. When a search is scoped to a specific wing and room, the palace structure provides a 34% boost in retrieval recall over flat search, according to the project's own benchmarks. The backend is pluggable through a simple interface in base.py, so the default ChromaDB setup can be swapped out without rewriting the storage logic.

Benchmarks

LongMemEval is the standard academic benchmark for AI memory systems. It tests whether a system can retrieve the correct conversation session when given a question that requires information from that session. R@5 measures whether the correct session appears anywhere in the top five results.

MemPalace's raw mode scores 96.6% R@5 across all 500 LongMemEval questions with zero API calls. This has been independently reproduced on an M2 Ultra in under five minutes. The benchmark runner, dataset, and result files are all committed in the repository. No API key is needed to reproduce it.

The hybrid v4 pipeline adds keyword boosting, temporal proximity scoring, and preference weighting on top of the raw vector search. On the held-out 450-question split (tuned on 50 dev questions, not seen during training), hybrid v4 reaches 98.4% R@5. With optional LLM reranking added, the score climbs higher still, but the raw and hybrid-no-LLM numbers are the ones that matter for the local-first pitch.

That 96.6% figure is, as of April 2026, the highest published LongMemEval score for any system requiring no API key, no cloud dependency, and no LLM at any stage. Mem0 sits around 85%. The gap comes down to a simple design choice: keeping the original text instead of distilling it.

The Knowledge Graph

MemPalace includes a temporal entity-relationship graph backed by SQLite. Facts are stored as triples (subject, predicate, object) with validity periods, so they can be added, queried, and invalidated over time as circumstances change.

This is separate from the verbatim storage. The knowledge graph handles structured facts: "project X uses framework Y as of March 2026", or "person Z prefers dark mode". When a fact becomes outdated, rather than deleting it, you mark it with an end date. Future queries then retrieve only the current truth without losing the historical record.

The graph integrates with the MCP server, so any connected AI agent can add or query facts as part of its normal workflow.

AAAK Compression

MemPalace ships an optional lossy compression format called AAAK (Aggressive Aggressive Abbreviated Knowledge). It combines short entity codes, structural markers, and sentence truncation to achieve roughly 30x compression on conversation text. The key design decision: AAAK-encoded text remains readable by any LLM without a dedicated decoder. A model encountering AAAK for the first time can parse it from context.

AAAK trades accuracy for storage efficiency. On LongMemEval, AAAK mode drops recall to 84.2% R@5, a 12.4 percentage point regression from the raw 96.6%. The team is upfront about this. AAAK is a compression layer for situations where disk space matters more than perfect retrieval. It is not the default, and the headline benchmark numbers do not use it.

MCP Integration

MemPalace ships 29 MCP tools covering five categories: palace reads and writes, knowledge graph operations, cross-wing navigation, drawer management, and agent diaries. Setting it up with Claude is a single command:

claude mcp add mempalace -- python -m mempalace.mcp_server

After that, Claude automatically calls the relevant tools when it needs to search memory or record new information. It also works with Gemini CLI and any other MCP-compatible client. For local models that do not support MCP, a wake-up command loads roughly 170 tokens of key context, enough for the model to follow the memory protocol without a full MCP integration.

Agent Support

In multi-agent setups, each specialist agent gets its own wing in the palace and its own diary for tracking what it did and why. Agents are discoverable at runtime via mempalace_list_agents, so a coordinator agent can find out what specialist agents exist and query their memory without any prior configuration. This makes MemPalace practical for systems where agents are added or removed dynamically.

Install and Use

pip install mempalace
mempalace init ~/projects/myapp

Requirements are minimal: Python 3.9 or later, ChromaDB, and roughly 300MB of disk space for the default embedding model. No API key is needed for the core feature set. The project is released under the MIT licence.

MemPalace proves that local-first memory, done right, can match or beat cloud services. No subscription, no data leaving your machine, no token burn on extraction passes. Just the conversations you had, stored in full, searchable the moment you need them.