A reproducible benchmark for structural code memory. Four setups, ten tasks, one number per cell.
Every tool in the "AI coding memory" space makes a savings claim. None of them publish a benchmark you can run yourself. EngramBench is engram's down-payment on that: a harness, a task set, a scoring rule, and a reference report.
You should not trust engram's "82% token reduction" claim just because engram says so. You should run this benchmark against your own codebase — or against the reference project — and see the number for yourself. If it holds, cite it. If it doesn't, file an issue.
For each benchmark task, we measure total prompt tokens consumed to reach a correct answer, under four setups:
| Setup | Description |
|---|---|
| baseline | Bare Claude Code, no memory tool. The agent uses Read/Grep/Glob directly. |
| cursor-memory | Simulates Cursor's prose memory approach. (v0.2 will replace this with a live Cursor run.) |
| anthropic-memorymd | Uses Anthropic's native MEMORY.md (prose block). |
| engram | engram v0.3.1+ with PreToolUse hooks enabled. |
Lower is better. The primary metric is relative reduction vs. baseline. Secondary metrics: Read hit rate, false-injection rate, time-to-answer.
v0.1 ships 10 structural tasks — the kind of question an agent actually asks
before editing code. Each task has a canonical correct answer and a scoring
rubric. See tasks/ for the full definitions.
- task-01-find-caller — "What calls
validateToken?" Graph traversal. - task-02-parent-class — "What does
SessionStoreextend?" Inheritance edge lookup. - task-03-file-for-class — "Which file defines
AuthService?" Label → file resolution. - task-04-import-graph — "What modules import
src/auth.ts?" Incoming import edges. - task-05-exported-api — "What does
src/cli.tsexport?" File → export nodes. - task-06-landmine-check — "Have we fixed a bug in
src/query.tsrecently?" Mistake node lookup. - task-07-architecture-sketch — "Summarize the architecture of this repo in ≤200 tokens." Top-connected-nodes query.
- task-08-refactor-scope — "If I rename
queryGraph, what files break?" 2-hop reverse dependency. - task-09-hot-files — "What files change most often?" Git log integration.
- task-10-cross-file-flow — "Trace the path from
handleReadto the graph query." Path-finding.
Each task is defined as a YAML file under tasks/ with:
id: task-01-find-caller
description: ...
reference_answer: ...
scoring_rubric: ...
expected_tokens:
baseline: 4500
engram: 800# Reference project: engram itself (self-host)
cd bench
./run.sh --setup engram --task all
# Custom project
./run.sh --project ~/my-repo --setup engram --task task-01-find-callerSTATUS: v0.1 is scaffolding only. The runner (run.sh) is a stub; the
reference answers come from manual Claude Code runs I've done on engram's own
codebase. v0.2 will automate the runner, add a cursor-memory setup, and ship
the first public leaderboard.
Every number in this benchmark must be reproducible by a stranger on a different machine. If you can't run it and get within 10% of the published number, it's a bug — file it.
Apache 2.0, same as engram.