Running Recall Bench with a coding agent
Recall Bench is built to be operated end-to-end by a coding agent. There is no GUI and no hidden state — every command is a recall-bench subcommand, every artifact is a file on disk, and the agent can drive each step, inspect the output, and recover from failures itself. This page is the quick orientation; the full operator’s playbook is bench-program.md in the repo.
Table of contents
Why a coding agent
A full benchmark is a long, failure-prone, multi-step job: author a persona, generate ~1,000 days of synthetic memories (hundreds of LLM calls), build Q&A pairs, then ingest-and-query a memory system at dozens of checkpoints. A coding agent is the right operator because it can:
- Run the pipeline — invoke each
recall-benchsubcommand and check the result before moving on. - Recover — every long step writes incrementally and is resume-safe; the agent re-runs from where it stopped.
- Grade — act as an out-of-band audit layer over the LLM judge (see The agent is the grader).
Model selectors
The same selector syntax works everywhere a model is taken (--model, --judge, --appellate-judge):
| Selector | Backed by | Env required |
|---|---|---|
claude, codex, copilot |
A local CLI coding agent subprocess | The CLI installed and signed in — no API keys |
openai or openai:<model-id> |
OpenAI API | OPENAI_API_KEY |
anthropic:<model-id> |
Anthropic API | ANTHROPIC_API_KEY |
azure:<deployment>[;<endpoint>] |
Azure OpenAI / Foundry | AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_VERSION |
<path/to/module.js> |
A JS module exporting a GeneratorModel / JudgeModel |
Whatever the module needs |
Coding-agent quirk. The local CLI agents (
claude,codex,copilot) ignore--temperatureand--max-tokens— the pipeline strips those flags before invoking them. Configure generation parameters through the agent’s own config instead. Those flags are honored only for API model specs and JS modules.
The pipeline
persona.yaml + arcs-NNNd.yaml
│
├─ Pass 1: generate → memories-NNNd/day-XXXX.md (daily logs)
├─ Pass 2: generate-conversations → conversations-NNNd/ (user↔agent turns, optional)
├─ Pass 3: generate-tool-calls → tools-NNNd/ (memorySave calls, optional)
└─ Pass 4: generate-qa → qa-NNNd/ (graded Q&A pairs)
│
bench profile + adapter ──→ recall-bench run ──→ bench-results/<run>/
result.json · progress.jsonl
failures.jsonl · heatmap.png
Each pass is independent — regenerate any one without redoing the others.
1. Author the persona
recall-bench create-persona \
--prompt "An emergency physician working night shifts at a level-1 trauma center..." \
--persona ./packages/recall-bench/personas/er-physician \
--model azure:gpt-5.4
Produces a draft persona.yaml + arcs-1000d.yaml. Both usually need a hand-edit pass (sessions, cast, arc primarySession assignments) before generation.
2. Generate the corpus (Pass 1)
recall-bench generate \
--persona ./packages/recall-bench/personas/er-physician \
--model claude \
--start 1 --end 180
The longest step — minutes per day on dense personas. Files are written as each day is produced, so a crash at day 437 leaves 436 days on disk; resume with --start 437. Run it as a background job and tail progress.
3. Conversations / tool-calls (Passes 2–3, optional)
generate-conversations reconstructs the dialog turns that would have produced each day; generate-tool-calls emits the memorySave(...) sequence an agent would have made. Use these for systems that ingest transcripts or a tool loop rather than raw markdown.
4. Author Q&A (Pass 4)
# Standard recall questions
recall-bench generate-qa --persona <dir> --model azure:gpt-5.4 --arcs arcs-180d.yaml
# Information-boundary probes for multi-session personas
recall-bench generate-qa --persona <dir> --model azure:gpt-5.4 --arcs arcs-180d.yaml --mode boundary
Generated incrementally in checkpoint windows; resume-safe.
5. Run the benchmark
recall-bench run \
--profile packages/recall-bench/profiles/ea-180d-recall-baseline.yaml \
--json-out bench-results/drafts/<run-id>/result.json
A profile bundles persona, adapter, judge models, ranges, sampling, and timeouts — prefer editing a profile over piling on CLI flags. Pointing --json-out at a per-run directory makes the bench write the canonical siblings (heatmap.png, progress.jsonl, failures.jsonl) alongside it.
Always set
--json-out. Without it you can’t resume — a killed run starts over. Resume with--resume bench-results/drafts/<run-id>/progress.jsonl.
The agent is the ultimate grader
The LLM judge (models.judge / models.appellateJudge) makes calibration errors in both directions. Left alone, the bench measures the judge as much as the system. The coding agent is the out-of-band audit layer: after each checkpoint lands in progress.jsonl, read that checkpoint’s questions from questions.jsonl and re-grade — classifying each disagreement as a judge false negative, a judge false positive, a Q&A defect (the reference is wrong), or a real system failure. Only the last should drive iteration.
Every run gets a notes.md in its results directory that the agent owns and the user reads — a per-checkpoint log plus an end-of-run analysis that explains why the scores look the way they do, with source file:line citations for failure clusters. The bar: “scores were bad because the cross-encoder over-promotes token-similar dailies on date-pinned queries, see search.ts:495” — not “scores were bad.”
See the published-runs postmortem for a worked example of this grading methodology across nine runs.
Long jobs in the background
- Set
--json-outso artifacts exist for resume. - Stream
progress.jsonl— tail it for per-checkpoint records, or count lines to gauge progress. - Capture stderr (
2>&1 | tee run.log) — generation passes log per-day failures there. - Plan for resume —
--resume <progress.jsonl>rather than restarting on any interruption. - Watch for failure, not just completion — silent watching never tells you a process crashed.
Where the canonical references live
| Topic | File |
|---|---|
| Full operator’s playbook | bench-program.md |
| Benchmark design spec | specs/recall-bench.md |
| Per-harness build/run steps | bench-harnesses/<system>/harness-program.md (when present) |
| Existing profiles | packages/recall-bench/profiles/ |
| Scoring & categories | Recall Bench overview |