Memory benchmark · LongMemEval

Long-horizon memory evaluation

LongMemEval is the academic standard for personal memory systems — 500 questions across six categories drawn from realistic multi-session conversations. We ran RetainDB against the same oracle split used in published research.

LongMemEval — ICLR 2025
Category results
Oracle split · gpt-5.4 extractor & answer · gpt-5.4-mini judge
Better or equalWithin 8 pp
CategoryRetainDBSupermemoryZep
Single-session user
Recall facts the user stated about themselves
88%97%93%
Single-session preference
Surface preferences for personalised response generation
88%70%57%
Temporal reasoning
Reason over dates, durations, and event ordering
74%77%62%
Knowledge update
Retrieve the most-recent value when facts changed over time
76%89%83%
Multi-session
Synthesise facts scattered across many conversations
68%71%58%
Overall
Published / in progress
79%81.6%

Supermemory and Zep figures sourced from published research pages. Remaining RetainDB categories are actively being evaluated.

How RetainDB handles it

LongMemEval tests recall across real multi-session conversations. Our pipeline is built for this: extract memories turn-by-turn, write them with provenance, retrieve chronologically.

01
Turn-by-turn extraction

Every turn (user and assistant) is processed individually by gpt-5.4-mini. A sliding 3-turn context window disambiguates pronouns and vague references without leaking future evidence.

02
Atomic memory writes

Extracted memories are written with canonical deduplication. Each memory stores an eventDate, documentDate, and confidence score. A quality gate rejects chatter, multi-fact blobs, and unresolved pronouns before indexing.

03
Chronological retrieval

For each question, all memories for the isolated project are dumped in date order and passed to the answer model. No lossy semantic retrieval — the full timeline is visible, which is critical for temporal-reasoning questions.

04
Answer generation

gpt-5.4 answers the question given today's date and the memory timeline. The model is instructed to always attempt an answer if relevant dates or facts exist, and compute durations rather than returning "I don't know".

Methodology

Every run is reproducible. The benchmark runner ships in our repo.

Dataset
LongMemEval (ICLR 2025) — oracle split
Questions
Up to 50 per category, 6 categories
Extraction model
gpt-5.4 (EXTRACTOR_MODEL)
Answer model
gpt-5.4, temperature 0, full memory dump
Judge model
gpt-5.4-mini (official evaluator)
Date
March 2026
Isolation
Separate DB project per question — no cross-contamination
Ingestion
Turn-by-turn extraction, 3-turn sliding context window
Second benchmark
Retrieval benchmark · Hallucination

Code hallucination benchmark

16 questions about bleeding-edge SDK APIs tested against GPT-5 with and without RetainDB. Without grounded context, the model hallucinated on nearly every question.

0%
RetainDB hallucination rate
grounded retrieval mode
95.5%
GPT-5 hallucination rate
no grounded context
94.8%
Retrieval recall
across 16 questions
13 ms
Avg retrieval latency
memory benchmark
Hallucination rate comparison
Lower is better. 16 bleeding-edge API questions, March 2026.
same question set
RetainDB
grounded retrieval
0%
GPT-5 Web
web search baseline
89.6%
GPT-5
no grounded context
95.5%
Real example

One question from the benchmark, shown in full.

Question

Enable Claude extended thinking mode via the Python SDK.

GPT-5 — no grounded context

No usable implementation provided. The model described a thinking parameter that does not exist in the current SDK, producing code that throws an AttributeError at runtime.

RetainDB — grounded retrieval
from anthropic import Anthropic

client = Anthropic()

message = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=1024,
    thinking={"type": "enabled", "budget_tokens": 1024},
    messages=[{"role": "user", "content": "Solve this step-by-step."}],
)
Sample questions

6 of 16 questions. RetainDB answered correctly on all 16.

QuestionRetainDBBaselineFailure mode
Enable Claude extended thinking mode via the Python SDK.CorrectHallucinatedModel fabricated a parameter that does not exist in the SDK.
Use the Anthropic SDK to stream a message with tool use.CorrectHallucinatedModel used an outdated streaming API signature.
Configure a custom base URL in the OpenAI Node SDK v4.CorrectHallucinatedModel described v3 configuration options.
Add a timeout to a LangChain chat model.CorrectHallucinatedModel invented a timeout parameter that was never in the API.
Use the Vercel AI SDK to stream text with a system prompt.CorrectHallucinatedModel used a removed streamText signature.
Set per-request headers in the Anthropic SDK.CorrectHallucinatedModel described a non-existent headers option.
Methodology

Enough detail to inspect the claim without reading a paper.

Benchmark
16-question code hallucination matrix — bleeding-edge SDK APIs
Date
March 12, 2026
Models tested
GPT-5 and Claude Sonnet 4.5, temperature 0.0
Evaluation method
Custom hallucination classifier, documentation-only grounding mode
Retrieval mode
real_retrieval — live indexed documentation, no caching
Memory benchmark
10/10 successful retrievals — score 100, 13 ms avg latency (March 5, 2026)
Source benchmark
12/12 successful retrievals across 39 real files — score 100, 31.8 ms avg latency
Key finding
0% RetainDB hallucination rate vs 95.5% GPT-5 baseline on the same question set

Try it on your own codebase

Connect your docs or repo and run a query. See what grounded context does to your answers.

Start free Talk to us