LongMemEval is the academic standard for personal memory systems — 500 questions across six categories drawn from realistic multi-session conversations. We ran RetainDB against the same oracle split used in published research.
LongMemEval — ICLR 2025| Category | RetainDB | Supermemory | Zep |
|---|---|---|---|
Single-session user Recall facts the user stated about themselves | 88% | 97% | 93% |
Single-session preference Surface preferences for personalised response generation | 88% | 70% | 57% |
Temporal reasoning Reason over dates, durations, and event ordering | 74% | 77% | 62% |
Knowledge update Retrieve the most-recent value when facts changed over time | 76% | 89% | 83% |
Multi-session Synthesise facts scattered across many conversations | 68% | 71% | 58% |
Overall Published / in progress | 79% | 81.6% | — |
Supermemory and Zep figures sourced from published research pages. Remaining RetainDB categories are actively being evaluated.
LongMemEval tests recall across real multi-session conversations. Our pipeline is built for this: extract memories turn-by-turn, write them with provenance, retrieve chronologically.
Every turn (user and assistant) is processed individually by gpt-5.4-mini. A sliding 3-turn context window disambiguates pronouns and vague references without leaking future evidence.
Extracted memories are written with canonical deduplication. Each memory stores an eventDate, documentDate, and confidence score. A quality gate rejects chatter, multi-fact blobs, and unresolved pronouns before indexing.
For each question, all memories for the isolated project are dumped in date order and passed to the answer model. No lossy semantic retrieval — the full timeline is visible, which is critical for temporal-reasoning questions.
gpt-5.4 answers the question given today's date and the memory timeline. The model is instructed to always attempt an answer if relevant dates or facts exist, and compute durations rather than returning "I don't know".
Every run is reproducible. The benchmark runner ships in our repo.
16 questions about bleeding-edge SDK APIs tested against GPT-5 with and without RetainDB. Without grounded context, the model hallucinated on nearly every question.
One question from the benchmark, shown in full.
Enable Claude extended thinking mode via the Python SDK.
No usable implementation provided. The model described a thinking parameter that does not exist in the current SDK, producing code that throws an AttributeError at runtime.
from anthropic import Anthropic
client = Anthropic()
message = client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=1024,
thinking={"type": "enabled", "budget_tokens": 1024},
messages=[{"role": "user", "content": "Solve this step-by-step."}],
)6 of 16 questions. RetainDB answered correctly on all 16.
| Question | RetainDB | Baseline | Failure mode |
|---|---|---|---|
| Enable Claude extended thinking mode via the Python SDK. | Correct | Hallucinated | Model fabricated a parameter that does not exist in the SDK. |
| Use the Anthropic SDK to stream a message with tool use. | Correct | Hallucinated | Model used an outdated streaming API signature. |
| Configure a custom base URL in the OpenAI Node SDK v4. | Correct | Hallucinated | Model described v3 configuration options. |
| Add a timeout to a LangChain chat model. | Correct | Hallucinated | Model invented a timeout parameter that was never in the API. |
| Use the Vercel AI SDK to stream text with a system prompt. | Correct | Hallucinated | Model used a removed streamText signature. |
| Set per-request headers in the Anthropic SDK. | Correct | Hallucinated | Model described a non-existent headers option. |
Enough detail to inspect the claim without reading a paper.
Connect your docs or repo and run a query. See what grounded context does to your answers.
These links tie the benchmark to the commercial and educational pages that matter for SEO and buyer intent.
The core landing page this benchmark should support.
A direct comparison page for one of the highest-intent searches.
A comparison page that uses the benchmark in a buyer-facing way.
A direct comparison against another benchmark-heavy competitor in the same lane.
The most user-visible memory behavior behind the benchmark.
How benchmarked memory fits into a larger context stack.