Memory benchmark · LongMemEval

Long-horizon memory evaluation

LongMemEval is the academic standard for personal memory systems — 500 questions across six categories drawn from realistic multi-session conversations. We ran RetainDB against the same oracle split used in published research.

LongMemEval — ICLR 2025

Category results

Oracle split · gpt-5.4 extractor & answer · gpt-5.4-mini judge

Better or equalWithin 8 pp

Category	RetainDB	Supermemory	Zep
Single-session user Recall facts the user stated about themselves	88%	97%	93%
Single-session preference Surface preferences for personalised response generation	88%	70%	57%
Temporal reasoning Reason over dates, durations, and event ordering	74%	77%	62%
Knowledge update Retrieve the most-recent value when facts changed over time	76%	89%	83%
Multi-session Synthesise facts scattered across many conversations	68%	71%	58%
Overall Published / in progress	79%	81.6%	—

Supermemory and Zep figures sourced from published research pages. Remaining RetainDB categories are actively being evaluated.

How RetainDB handles it

LongMemEval tests recall across real multi-session conversations. Our pipeline is built for this: extract memories turn-by-turn, write them with provenance, retrieve chronologically.

Turn-by-turn extraction

Every turn (user and assistant) is processed individually by gpt-5.4-mini. A sliding 3-turn context window disambiguates pronouns and vague references without leaking future evidence.

Atomic memory writes

Extracted memories are written with canonical deduplication. Each memory stores an eventDate, documentDate, and confidence score. A quality gate rejects chatter, multi-fact blobs, and unresolved pronouns before indexing.

Chronological retrieval

For each question, all memories for the isolated project are dumped in date order and passed to the answer model. No lossy semantic retrieval — the full timeline is visible, which is critical for temporal-reasoning questions.

Answer generation

gpt-5.4 answers the question given today's date and the memory timeline. The model is instructed to always attempt an answer if relevant dates or facts exist, and compute durations rather than returning "I don't know".

Methodology

Every run is reproducible. The benchmark runner ships in our repo.

Dataset

LongMemEval (ICLR 2025) — oracle split

Questions

Up to 50 per category, 6 categories

Extraction model

gpt-5.4 (EXTRACTOR_MODEL)

Answer model

gpt-5.4, temperature 0, full memory dump

Judge model

gpt-5.4-mini (official evaluator)

Date

March 2026

Isolation

Separate DB project per question — no cross-contamination

Ingestion

Turn-by-turn extraction, 3-turn sliding context window

Second benchmark

Retrieval benchmark · Hallucination

Code hallucination benchmark

16 questions about bleeding-edge SDK APIs tested against GPT-5 with and without RetainDB. Without grounded context, the model hallucinated on nearly every question.

RetainDB hallucination rate

grounded retrieval mode

95.5%

GPT-5 hallucination rate

no grounded context

94.8%

Retrieval recall

across 16 questions

13 ms

Avg retrieval latency

memory benchmark

Hallucination rate comparison

Lower is better. 16 bleeding-edge API questions, March 2026.

same question set

RetainDB

grounded retrieval

GPT-5 Web

web search baseline

89.6%

GPT-5

no grounded context

95.5%

Real example

One question from the benchmark, shown in full.

Question

Enable Claude extended thinking mode via the Python SDK.

GPT-5 — no grounded context

No usable implementation provided. The model described a thinking parameter that does not exist in the current SDK, producing code that throws an AttributeError at runtime.

RetainDB — grounded retrieval

from anthropic import Anthropic

client = Anthropic()

message = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=1024,
    thinking={"type": "enabled", "budget_tokens": 1024},
    messages=[{"role": "user", "content": "Solve this step-by-step."}],
)

Sample questions

6 of 16 questions. RetainDB answered correctly on all 16.

Question	RetainDB	Baseline	Failure mode
Enable Claude extended thinking mode via the Python SDK.	Correct	Hallucinated	Model fabricated a parameter that does not exist in the SDK.
Use the Anthropic SDK to stream a message with tool use.	Correct	Hallucinated	Model used an outdated streaming API signature.
Configure a custom base URL in the OpenAI Node SDK v4.	Correct	Hallucinated	Model described v3 configuration options.
Add a timeout to a LangChain chat model.	Correct	Hallucinated	Model invented a timeout parameter that was never in the API.
Use the Vercel AI SDK to stream text with a system prompt.	Correct	Hallucinated	Model used a removed streamText signature.
Set per-request headers in the Anthropic SDK.	Correct	Hallucinated	Model described a non-existent headers option.