Recall Bench

A benchmark harness for evaluating agent memory systems on long-horizon recall (up to 1,000 days of synthetic daily logs per persona).

What’s in this section

Recall Bench overview — The scoring dimensions, the ten recall categories, the persona corpus, the three-phase evaluation loop, and what each category measures.
Running with a coding agent — How to drive the full pipeline (create persona → generate corpus → run → analyze) with a coding agent like Claude, Codex, or Copilot.
vs. MemoryBench — How Recall Bench (long-horizon recall fidelity) compares to MemoryBench (continual learning from user feedback).
Published results:
- Published Runs — Postmortem — Cross-system retrospective of all nine published runs, with failure triage and code-level findings.
- OpenClaw EA benchmark — 180d + 500d — Combined report covering both Executive-Assistant runs, with issue analysis pulled from the failure logs.

What gets measured

Each Q&A pair is scored on three judged dimensions and tagged with one recall category:

correctness (0–3) + completeness (0–2) + hallucination (0–1) = composite score (0–6)

The hallucination dimension is held independently so a system can be confidently wrong (high recall, low hallucination grounding) or accurately silent (low recall, high hallucination grounding) — mixing them into one number hides which failure mode dominates.

Eight core categories tag each question: factual-recall, temporal-reasoning, decision-tracking, contradiction-resolution, cross-reference, recency-bias-resistance, synthesis, negative-recall. Two group-aware categories — group-session-attribution and information-boundary — add session-attribution and cross-session leakage tests for multi-session personas; they are opt-in via --groups-enabled. The harness reports per-category scores so you can see which kind of memory work degrades first as the corpus grows.

Running the benchmark

Operator’s playbook for running and managing benchmarks lives in the repo at bench-program.md. Per-harness build instructions (for harnesses that need to be built inside a sibling repo, like OpenClaw) live in each bench-harnesses/<system>/harness-program.md.

Recall Bench

What’s in this section

What gets measured

Running the benchmark

Table of contents