Benchmark Learnings
This page records the engineering lessons from the public-suite precision push. It is a decision log, not a marketing table.
Current Verified State
The latest verified release-target gate was run against the expanded public suite:
PYTHONPATH=src python -m agentpack.cli benchmark \
--public-suite \
--no-public-table \
--benchmark-jsonl /tmp/agentpack-full-maintenance-recovery.jsonl
Result:
| Metric | Result |
|---|---|
| Scored cases | 108 |
| Avg precision | 43.6% |
| Avg recall | 66.0% |
| Avg F1 | 48.2% |
| Avg token precision | 51.1% |
| Recall target | Passed, 65.0% |
| Token precision target | Passed, 51.0% |
This is the first local checkpoint in this cycle that clears the 65% recall bar
while keeping token precision above 51%. The precision margin is real but thin,
so release notes should report the exact benchmark command and result artifact
rather than only the rounded headline. The release result note is
benchmarks/results/2026-06-14-public.md.
2026-06-13 Release-Target Experiment Log
The release-target work converged only after separating evaluation noise, diagnostics, and selector behavior. The successful path was not a larger budget or broader ranking boost; it was a narrow maintenance-context recovery rule validated by intent-level diagnostics.
| Run | Avg recall | Avg token precision | Outcome |
|---|---|---|---|
| Intent diagnostics baseline | 64.2% | 52.5% | Below 65% recall, but precision was healthy enough to inspect misses by intent. |
| Cleanup recovery, first full run | 64.8% | 51.9% | Recall improved but stayed below target; precision stayed above the 51% floor. |
| Maintenance recovery final run | 66.0% | 51.1% | Target cleared across 108 scored public cases. |
Two sensitive regression slices were run after the full suite:
| Slice | Cases | Avg recall | Avg token precision | Readout |
|---|---|---|---|---|
pallets-click |
25 | 71.0% | 50.3% | Passed the default gate, but precision is thin. |
nestjs |
4 | 75.0% | 43.9% | Recall is stable; token precision remains below the slice target. |
The final useful intent readout from the 108-case run was:
| Intent | Cases | Avg recall | Avg token precision | Misses | Interpretation |
|---|---|---|---|---|---|
cleanup_refactor |
9 | 57.4% | 67.1% | 5 | Best safe recall-recovery area; high precision allowed narrow cap/floor exceptions. |
typing_api |
17 | 70.6% | 57.5% | 10 | Improved from deprecation maintenance recovery, but still has cap/floor misses. |
dependency_release |
25 | 70.1% | 66.9% | 21 | Good precision, but many misses are legitimate multi-file dependency changes. |
test_focus |
20 | 77.4% | 51.1% | 21 | Recall is strong, but precision is near the floor; avoid broad test expansion. |
config_build |
14 | 48.8% | 35.0% | 14 | Main remaining bottleneck; low precision makes broad config recovery risky. |
source_behavior |
10 | 70.0% | 35.1% | 9 | Many plausible source files are unlabeled or noisy; needs better scope/package intent. |
docs_metadata |
5 | 40.0% | 9.1% | 6 | Too noisy for selector tuning until labels and task intent are audited. |
Successful Experiments
| Experiment | Result | Why it worked |
|---|---|---|
| Intent diagnostics | Added By Intent summaries and JSON diagnostics without changing selector behavior. |
It identified cleanup_refactor as low recall but high token precision, which made it a safer target than config_build or docs. |
| Label-audit diagnostics | Separated audited noise from plausibly useful unlabeled context. | It prevented overreacting to low raw precision in cases where selected context was related but not labeled expected. |
| Maintenance summary-floor bypass | Allowed cleanup/refactor/deprecation candidates through the summary floor only with direct evidence. | It recovered files that were ranked and relevant but blocked by strict precision guards. |
| Cleanup/refactor cap overflow | Allowed at most two cheap compressed maintenance candidates under strict caps. | It recovered maintenance files without widening the global summary cap. |
| Token-neutral maintenance replacement | Let stronger maintenance candidates replace weaker selected maintenance summaries. | It improved slot quality without increasing token volume. |
| Deprecation-specific content gate | Allowed deprecation maintenance files with at least two content hits and a 60-point score floor. | It recovered MarkupSafe deprecation cleanup while avoiding broad recently-modified file inclusion. |
Failed Or Marginal Experiments
| Experiment | Result | Lesson |
|---|---|---|
| Drastically widening the token budget | Did not produce a reliable move past 65% recall and risked lowering token precision. | The bottleneck was not only budget size; it was which marginal files survived selection. |
| Broad ranking boosts | Produced little or no durable aggregate gain. | Many expected files were already candidates; ranking-only changes did not fix cap, floor, and replacement losses. |
| Static variable/path-style expansion | Rejected as overfitting risk. | New rules should encode portable conventions or dynamic evidence, not public-suite expected-file lists. |
Broad cleanup/refactor trigger including generic refactor and import |
Caused Vite precision regressions without recall gains. | Generic refactor/import wording is too broad; the retained trigger is limited to maintenance terms such as unused, cleanup, simplify, lint, format, polish, and deprecation. |
| Scope-sticky cleanup overflow | Dropped the Spring slice from 65.0% to 63.3% recall. | The first overflow scope is not always the expected scope; replacement is safer than locking future overflow to the first selected scope. |
Forcing remove unused config toward pyproject.toml |
Rejected after slice inspection. | The competing workflow file scored higher with the same evidence; forcing the expected file would be label-specific rather than generic. |
| More test cap expansion | Kept narrow only. | Test-focused recall is already high, while token precision is close to the floor; broad test expansion would spend the precision margin. |
Eval-Set Caveats
The expanded public suite is useful, but it is not a perfect product-quality oracle.
- The command prints 128 public cases in some runs, but the current scored
release-target JSONL has 108 cases with
expected_files. Report the scored count when citing recall and precision. - Expected files come from real commits, but labels are incomplete. Some selected "noise" is plausibly useful context, especially same-package, same-family, dependency, and test-adjacent files.
- Added files must be filtered out when benchmarking the parent checkout, because AgentPack cannot select files that do not exist yet.
- Token precision is harsher than file recall. A selected expected file can still contribute few expected tokens if the packed summary is too broad.
- Vite and config/build tasks remain precision-sensitive. A change that improves Spring or MarkupSafe can still regress Vite because plausible source, playground, config, and test files compete for the same slots.
- Slice wins are diagnostic, not release claims. The release claim needs a full public-suite run with exact recall and token precision.
What Improved
The precision target improved because the work separated the problem into ranking, selection, and packing failures instead of tuning one global score.
The most useful improvements were:
| Change | Why it helped |
|---|---|
| Literal definition matching | Quoted API names such as parseAst should prefer the defining/exporting file over call-site noise. |
| Multi-term path ranking | Tasks with several concrete path terms should reward files whose paths contain those terms, especially config files. |
| Conditional two-config cap | Low-budget strict packs can include two strongly matched config files without opening the door to generic config noise. |
| Package-root source detection | Monorepos often keep source under packages/<name>/... without a src/ segment. Those files need direct-source priority when evidence is strong. |
| Narrow root-Go strict support | Root Go source files with conventional-scope source evidence recovered Gin recall without expanding the summary cap. |
| Same-package paired-test overflow | Balanced no-live packs may include one extra packages/<name>/... test only when it directly tests an already selected source file and has direct content evidence. This recovered a NestJS expected test without a broad cap increase. |
| Same-playground test overflow | Balanced no-live packs may include one extra playground/<name>/... test only when the same playground already has selected context plus scope and phrase/content evidence. This recovered one Vite playground test with near-neutral token precision. |
| JVM build metadata signal | Java build/dependency tasks often expect root pom.xml or Gradle metadata. Root JVM build files now get a scoped boost. |
| Reason/family diagnostics | reason_family_precision, selected family waste, failure type counts, and low-budget last-file waste made tuning evidence-backed. |
| Parent-checkout expected-file filtering | Public history samples now exclude paths that do not exist in the parent checkout, so added files are not counted as selectable recall misses. |
What Failed
These experiments were rejected because they helped one slice while hurting the full suite:
| Rejected change | Failure mode |
|---|---|
| Broad content-only concrete boost | It over-selected noisy files with generic content hits and regressed TypeScript precision. |
Treating .go files as release metadata |
It pulled version.go into unrelated Gin tasks and hurt Go precision. |
| Broad explicit-test cap increase | It improved some recall but admitted too much test noise. |
| Specific-config pack suppression | It cleaned one Tailwind config case, but regressed CSS/config tasks that still needed source files. |
Broad build metadata boost for pyproject.toml and package files |
It fixed Spring-like tasks but hurt Python dependency/update cases, especially MarkupSafe. |
| Broad source strict-support exception | It recovered a few Go source files but admitted non-expected Java source and spent precision margin. Narrowing to the measured root-Go pattern kept the gain and removed the Java regression. |
| Compact-cost selection priority for JS/TS skeletons | It fixed one NestJS ordering issue, but reduced Vite precision by promoting plausible package tests over expected playground/config files. The safer retained rule is paired-test overflow only. |
| Weak package-test overflow | A Vite worker.spec.ts test looked locally related but had only 3 content hits and no task phrase for an import glob task. Package test overflow now requires stronger task evidence to avoid this slot waste. |
The rule from these failures: keep boosts scoped to the evidence family that proved the gain. Do not generalize from one benchmark case until the repo slice and full suite confirm it.
Benchmark Validity Guardrail
Public history cases run AgentPack against the parent of a real commit. A file added by that commit does not exist in the parent checkout, so AgentPack cannot select it. Sampled expected files must therefore be filtered to paths that exist in the parent commit.
This is a methodology correction, not a ranking improvement. It removes
impossible EXPECTED_NOT_FOUND misses from sampled public cases while preserving
modified/deleted files that are selectable from the parent checkout.
Static-List Guardrail
Static lists are acceptable only when they encode portable ecosystem conventions, not benchmark-case outcomes.
Acceptable examples:
| Convention | Why it is acceptable |
|---|---|
Test filename forms such as _test.go, .spec.ts, and tests/ |
These are language and framework conventions used outside the benchmark suite. |
Root build metadata such as pom.xml, Gradle files, and package manifests |
These are standard project files for dependency/build tasks. |
| File extensions and generated/example/test directory names | These describe broad file families, not one repo's expected answer. |
Risky examples:
| Shortcut | Why it is risky |
|---|---|
| Boosting a specific filename because one public case expected it | It can inflate one slice while hurting unrelated tasks. |
| Adding repo-shaped path rules such as a known package or fixture directory | It makes benchmark numbers less trustworthy and does not generalize. |
| Treating a language extension as task metadata without stronger evidence | It admits plausible-but-not-actionable files and wastes selection slots. |
New ranking rules should prefer dynamic evidence: task literals, symbol definitions, imports/calls, same-package locality, dependency edges, changed-file history, and measured reason-family precision. If a static convention is added, it needs a non-benchmark rationale plus negative tests that prove it does not revive known noisy families.
Slice Readout
The latest useful slice picture:
| Slice | Status |
|---|---|
| Python CLI/library | No longer the main precision bottleneck. Click and itsdangerous are healthy; MarkupSafe recovered after narrowing build metadata. |
| TypeScript/Vite | Same-playground test overflow moved the current Vite slice from 44.5% to 45.7% recall with token precision essentially neutral at 45.5% to 45.4%. Config/source ranking and ranked-low expected files remain the main blockers. |
| Go/Gin | Improved after root Go source and test-path handling. Latest targeted slice moved from 48.1% to 55.6% recall while token precision rose slightly from 46.5% to 47.3%. |
| Java/Spring | Strong gain from scoped JVM build metadata. |
| TypeScript monorepo/NestJS | Same-package paired-test overflow improved the current scored NestJS slice from 62.5% to 70.8% recall and from 40.5% to 44.0% token precision, but the slice remains below the 50% precision target. |
Benchmark Loop Guardrail
Full public-suite runs are expensive enough that experiments should not use them as the first diagnostic. Use this loop instead:
- Establish one full-suite baseline JSONL.
- Inspect case-level misses and choose the affected repo or task-type slices.
- Run only those slices with
--public-repo-filteror--public-task-type-filterand write--benchmark-jsonl. - Compare case-level recall, token precision, selected paths, and miss status.
- Run the full public suite once only after the slice result has a credible chance of improving the aggregate gate.
Example:
agentpack benchmark \
--public-repos \
--public-repo-filter gin,spring-petclinic \
--public-repos-cache /tmp/agentpack-public-cache-full \
--benchmark-jsonl /tmp/agentpack-go-java.jsonl \
--misses
Current Mental Model
Do not ask only whether recall went up. Ask where the expected file was lost:
- Candidate generation: was the expected file found at all?
- Ranking: did it reach the top candidates?
- Selection: did the pack choose it under budget and family caps?
- Packing: did selected tokens include useful expected-file content or mostly noise?
Useful recall is not just "the file appeared somewhere." It means the expected file appeared early, survived selection, and contributed enough useful tokens.
Metrics To Keep Watching
These metrics were the most useful during tuning:
| Metric | Use |
|---|---|
candidate_recall_at_50 |
Separates discovery failures from ranking/selection failures. |
candidate_precision_at_3 |
Shows whether noisy files dominate the top of the ranked list. |
token_precision |
Main measure for packed-token usefulness. |
expected_token_coverage |
Approximates valuable recall for selected expected files. |
selected_family_waste_tokens |
Shows whether source, test, docs, config, fixtures, or generated files leak noise. |
reason_family_precision |
Shows which ranking reasons are trustworthy. |
failure_type_counts |
Splits misses into not found, ranked low, skipped, or noise selected above expected. |
precision_delta_if_drop_last_summary |
Identifies low-budget extra-file waste. |
Next Optimization Areas
The 2026-06-13 local checkpoint clears the immediate release target:
| Metric | Target |
|---|---|
| Expanded public-suite recall | 66.0% |
| Expanded public-suite token precision | 51.1% |
| Major language/task-slice recall regression | <= 2 points |
EXPECTED_NOT_FOUND sampled-public misses |
0 |
The next release should treat this as a achieved-but-thin margin, not as room for broad expansion. Token precision has only about one point of headroom above the 50% floor, so the next changes should start with diagnostics and targeted slice validation.
The next precision/recall work should focus on these areas:
-
Config/build intent recovery.
config_buildis the main remaining bottleneck: 48.8% recall and 35.0% token precision in the latest full run. Do not raise config caps broadly. First diagnose config-only, source-only, and config-plus-source tasks. -
Vite config/source balance. Some config tasks need only config files, while CSS/build tasks need both source and config. Selection needs conditional inclusion, not global suppression.
-
NestJS wrong-package locality. Monorepo tasks need better package/workspace intent detection so
packages/coreandintegration/*do not steal from each other unless the task evidence says so. -
Snippet/block packing. Several cases select the expected file but include too much surrounding noise. Function, class, config-section, and diff-hunk packing should improve token precision without sacrificing selected-file recall.
-
Publish decision gates. Keep changes only when full-suite recall stays at or above 65%, token precision stays at or above 51%, and no major language/task slice regresses by more than 2 recall points.