Name	Name	Last commit message	Last commit date
parent directory ..
results	results
tasks	tasks
README.md	README.md
real-world.ts	real-world.ts
run.sh	run.sh
runner.ts	runner.ts
stress-test.ts	stress-test.ts

EngramBench v0.1

A reproducible benchmark for structural code memory. Four setups, ten tasks, one number per cell.

Why this exists

Every tool in the "AI coding memory" space makes a savings claim. None of them publish a benchmark you can run yourself. EngramBench is engram's down-payment on that: a harness, a task set, a scoring rule, and a reference report.

You should not trust engram's "82% token reduction" claim just because engram says so. You should run this benchmark against your own codebase — or against the reference project — and see the number for yourself. If it holds, cite it. If it doesn't, file an issue.

What it measures

For each benchmark task, we measure total prompt tokens consumed to reach a correct answer, under four setups:

Setup	Description
baseline	Bare Claude Code, no memory tool. The agent uses Read/Grep/Glob directly.
cursor-memory	Simulates Cursor's prose memory approach. (v0.2 will replace this with a live Cursor run.)
anthropic-memorymd	Uses Anthropic's native MEMORY.md (prose block).
engram	engram v0.3.1+ with PreToolUse hooks enabled.

Lower is better. The primary metric is relative reduction vs. baseline. Secondary metrics: Read hit rate, false-injection rate, time-to-answer.

The tasks

v0.1 ships 10 structural tasks — the kind of question an agent actually asks before editing code. Each task has a canonical correct answer and a scoring rubric. See tasks/ for the full definitions.

task-01-find-caller — "What calls validateToken?" Graph traversal.
task-02-parent-class — "What does SessionStore extend?" Inheritance edge lookup.
task-03-file-for-class — "Which file defines AuthService?" Label → file resolution.
task-04-import-graph — "What modules import src/auth.ts?" Incoming import edges.
task-05-exported-api — "What does src/cli.ts export?" File → export nodes.
task-06-landmine-check — "Have we fixed a bug in src/query.ts recently?" Mistake node lookup.
task-07-architecture-sketch — "Summarize the architecture of this repo in ≤200 tokens." Top-connected-nodes query.
task-08-refactor-scope — "If I rename queryGraph, what files break?" 2-hop reverse dependency.
task-09-hot-files — "What files change most often?" Git log integration.
task-10-cross-file-flow — "Trace the path from handleRead to the graph query." Path-finding.

Each task is defined as a YAML file under tasks/ with:

id: task-01-find-caller
description: ...
reference_answer: ...
scoring_rubric: ...
expected_tokens:
  baseline: 4500
  engram: 800

Running the benchmark

# Reference project: engram itself (self-host)
cd bench
./run.sh --setup engram --task all

# Custom project
./run.sh --project ~/my-repo --setup engram --task task-01-find-caller

STATUS: v0.1 is scaffolding only. The runner (run.sh) is a stub; the reference answers come from manual Claude Code runs I've done on engram's own codebase. v0.2 will automate the runner, add a cursor-memory setup, and ship the first public leaderboard.

The ground rule

Every number in this benchmark must be reproducible by a stranger on a different machine. If you can't run it and get within 10% of the published number, it's a bug — file it.

License

Apache 2.0, same as engram.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

EngramBench v0.1

Why this exists

What it measures

The tasks

Running the benchmark

The ground rule

License

FilesExpand file tree

bench

Directory actions

More options

Directory actions

More options

Latest commit

History

bench

Folders and files

parent directory

README.md

EngramBench v0.1

Why this exists

What it measures

The tasks

Running the benchmark

The ground rule

License