Skip to content

Benchmarking

Benchmarking focuses on expected-file recall for real tasks. Public release evidence lives in benchmarks/README.md. The current tuning decision log lives in benchmark-learnings.md.

Quality Bar

AgentPack is best treated as a ranked starting map. It can reduce repeated orientation work, but the agent and reviewer still own correctness.

Signal What good looks like
Token reduction Measure against raw repo text for your repo; savings depend on task, ignores, and budget
Pack size Usually 8k-25k tokens for a specific task
Pack time Seconds on a warm cache; first summarize pass is slower
Recall Expected files appear near the top; validate with agentpack benchmark --misses
Precision Good enough to reduce exploration; summaries and repo maps may still include noise
Freshness Task or repo-stale MCP reads auto-refresh; static packs are clearly marked by task, git, and snapshot checks

Use real repo evals instead of trusting compression numbers:

agentpack benchmark --init
# add historical tasks and files actually changed
agentpack benchmark --compare --misses --public-table
agentpack benchmark --release-gate
agentpack benchmark --public-suite --reproduce v0.3.20
agentpack benchmark --results-template
agentpack benchmark capture --since HEAD~1 --task "describe the completed task"
agentpack benchmark capture --since main --task "describe the completed task" --anonymous-report

Skill Routing And Keyword Quality

Skill routing has its own benchmark fields. Use them when changing skill discovery, trigger generation, BM25/domain scoring, or future semantic search fusion.

[[cases]]
task = "review this pull request for SQL injection, XSS, and code quality"
expected_skills = ["code-reviewer"]
avoid_skills = ["generic-writing"]

[[cases]]
task = "translate my retail operations experience into a software resume"
expected_skills = ["Career Changer Translator"]
avoid_skills = ["generic-writing"]

Run:

agentpack benchmark --misses

The summary table and .agentpack/benchmark_results.jsonl report:

Metric Meaning
skill_recall_at_3 fraction of expected skills found in the top three
skill_precision_at_3 fraction of top-three skills that are expected
skill_mrr reciprocal rank of the first expected skill
skill_noise_rate fraction of top-three skills matching avoid_skills
selected_skills actual top skill recommendations

For keyword quality, write cases around the user wording that previously failed. The goal is not to preserve a static trigger list; it is to prove that real task phrases select the right skill and avoid broad generic recommendations.

agentpack benchmark --release-gate is the public release gate. It expands to --public-repos --prove-targets --misses --public-table, reads benchmarks/public-repos.toml by default, and can use --public-repos-cache or --refresh-public-repos.

For external claims, use several real repositories or anonymized historical task sets and publish the generated table from benchmarks/results/*-public.md. This repo includes a v0.3.20 public manifest in benchmarks/public-repos.toml; it has 8 pinned Pallets smoke commits plus 100+ sampled historical commits across Python, TypeScript, Go, Java, and monorepo projects. For sampled repos, sample_history = N takes recent first-parent, non-merge commits, derives expected_files from each commit diff, and filters them with include_globs, exclude_globs, and max_changed_files. Synthetic fixtures are useful regression tests, but should not be presented as market proof.

The current v0.3.23 release baseline is published as benchmarks/results/2026-06-14-public.md: 108 scored public cases at 66.0% recall and 51.1% token precision. The precision margin is thin, so use slice regressions before changing selector rules.

The v0.3.20 reproduction command is:

agentpack benchmark --public-suite --reproduce v0.3.20

agentpack benchmark --sample-fixtures is intentionally labeled as regression smoke. It proves the benchmark harness still catches known scenarios in this source checkout; it is not release evidence for ranking quality across public repositories.

Use agentpack benchmark capture --since <ref> --task "..." after a real task to append a reusable case to .agentpack/benchmark.toml. It infers expected_files from git diff --name-only <ref> HEAD. Use agentpack benchmark --from-history N --write-cases only as scaffolding; history-derived cases need manually filled expected_files before they prove recall.

Use --anonymous-report when sharing private-repo evidence. It writes aggregate report files under .agentpack/ without source code or private file paths.

AgentPack vs No-AgentPack A/B

File-selection benchmarks answer "did the pack include the right files?" E2E A/B runs answer "did the agent finish better with AgentPack than without it?"

agentpack benchmark e2e-init
agentpack benchmark e2e --cases .agentpack/e2e_cases.toml \
  --agent-command 'bash -lc "codex exec --cd {repo} \"$(cat {prompt})\""' \
  --strategies no-context,agentpack --trials 3 \
  --input-cost-per-mtok 1.25 --output-cost-per-mtok 10
agentpack benchmark e2e-report --baseline no-context --treatment agentpack --markdown

e2e-report compares task success, expected-file touch rate, tool calls, total tokens, estimated token cost, time-to-first-correct-file, and duration.

Download Stats

npm exposes official package download counts through its public registry API and the npm downloads badge above:

curl https://api.npmjs.org/downloads/point/last-month/%40vishal2612200%2Fagentpack
curl https://api.npmjs.org/downloads/point/last-week/%40vishal2612200%2Fagentpack

PyPI does not show official project download counts on package pages. For rough trend data on the Python core package, use third-party mirrors:

curl https://pypistats.org/api/packages/agentpack-cli/recent