Skip to content

Benchmark Learnings

This page records the engineering lessons from the public-suite precision push. It is a decision log, not a marketing table.

Current Verified State

The latest verified release-target gate was run against the expanded public suite:

PYTHONPATH=src python -m agentpack.cli benchmark \
  --public-suite \
  --no-public-table \
  --benchmark-jsonl /tmp/agentpack-full-maintenance-recovery.jsonl

Result:

Metric Result
Scored cases 108
Avg precision 43.6%
Avg recall 66.0%
Avg F1 48.2%
Avg token precision 51.1%
Recall target Passed, 65.0%
Token precision target Passed, 51.0%

This is the first local checkpoint in this cycle that clears the 65% recall bar while keeping token precision above 51%. The precision margin is real but thin, so release notes should report the exact benchmark command and result artifact rather than only the rounded headline. The release result note is benchmarks/results/2026-06-14-public.md.

2026-06-13 Release-Target Experiment Log

The release-target work converged only after separating evaluation noise, diagnostics, and selector behavior. The successful path was not a larger budget or broader ranking boost; it was a narrow maintenance-context recovery rule validated by intent-level diagnostics.

Run Avg recall Avg token precision Outcome
Intent diagnostics baseline 64.2% 52.5% Below 65% recall, but precision was healthy enough to inspect misses by intent.
Cleanup recovery, first full run 64.8% 51.9% Recall improved but stayed below target; precision stayed above the 51% floor.
Maintenance recovery final run 66.0% 51.1% Target cleared across 108 scored public cases.

Two sensitive regression slices were run after the full suite:

Slice Cases Avg recall Avg token precision Readout
pallets-click 25 71.0% 50.3% Passed the default gate, but precision is thin.
nestjs 4 75.0% 43.9% Recall is stable; token precision remains below the slice target.

The final useful intent readout from the 108-case run was:

Intent Cases Avg recall Avg token precision Misses Interpretation
cleanup_refactor 9 57.4% 67.1% 5 Best safe recall-recovery area; high precision allowed narrow cap/floor exceptions.
typing_api 17 70.6% 57.5% 10 Improved from deprecation maintenance recovery, but still has cap/floor misses.
dependency_release 25 70.1% 66.9% 21 Good precision, but many misses are legitimate multi-file dependency changes.
test_focus 20 77.4% 51.1% 21 Recall is strong, but precision is near the floor; avoid broad test expansion.
config_build 14 48.8% 35.0% 14 Main remaining bottleneck; low precision makes broad config recovery risky.
source_behavior 10 70.0% 35.1% 9 Many plausible source files are unlabeled or noisy; needs better scope/package intent.
docs_metadata 5 40.0% 9.1% 6 Too noisy for selector tuning until labels and task intent are audited.

Successful Experiments

Experiment Result Why it worked
Intent diagnostics Added By Intent summaries and JSON diagnostics without changing selector behavior. It identified cleanup_refactor as low recall but high token precision, which made it a safer target than config_build or docs.
Label-audit diagnostics Separated audited noise from plausibly useful unlabeled context. It prevented overreacting to low raw precision in cases where selected context was related but not labeled expected.
Maintenance summary-floor bypass Allowed cleanup/refactor/deprecation candidates through the summary floor only with direct evidence. It recovered files that were ranked and relevant but blocked by strict precision guards.
Cleanup/refactor cap overflow Allowed at most two cheap compressed maintenance candidates under strict caps. It recovered maintenance files without widening the global summary cap.
Token-neutral maintenance replacement Let stronger maintenance candidates replace weaker selected maintenance summaries. It improved slot quality without increasing token volume.
Deprecation-specific content gate Allowed deprecation maintenance files with at least two content hits and a 60-point score floor. It recovered MarkupSafe deprecation cleanup while avoiding broad recently-modified file inclusion.

Failed Or Marginal Experiments

Experiment Result Lesson
Drastically widening the token budget Did not produce a reliable move past 65% recall and risked lowering token precision. The bottleneck was not only budget size; it was which marginal files survived selection.
Broad ranking boosts Produced little or no durable aggregate gain. Many expected files were already candidates; ranking-only changes did not fix cap, floor, and replacement losses.
Static variable/path-style expansion Rejected as overfitting risk. New rules should encode portable conventions or dynamic evidence, not public-suite expected-file lists.
Broad cleanup/refactor trigger including generic refactor and import Caused Vite precision regressions without recall gains. Generic refactor/import wording is too broad; the retained trigger is limited to maintenance terms such as unused, cleanup, simplify, lint, format, polish, and deprecation.
Scope-sticky cleanup overflow Dropped the Spring slice from 65.0% to 63.3% recall. The first overflow scope is not always the expected scope; replacement is safer than locking future overflow to the first selected scope.
Forcing remove unused config toward pyproject.toml Rejected after slice inspection. The competing workflow file scored higher with the same evidence; forcing the expected file would be label-specific rather than generic.
More test cap expansion Kept narrow only. Test-focused recall is already high, while token precision is close to the floor; broad test expansion would spend the precision margin.

Eval-Set Caveats

The expanded public suite is useful, but it is not a perfect product-quality oracle.

  • The command prints 128 public cases in some runs, but the current scored release-target JSONL has 108 cases with expected_files. Report the scored count when citing recall and precision.
  • Expected files come from real commits, but labels are incomplete. Some selected "noise" is plausibly useful context, especially same-package, same-family, dependency, and test-adjacent files.
  • Added files must be filtered out when benchmarking the parent checkout, because AgentPack cannot select files that do not exist yet.
  • Token precision is harsher than file recall. A selected expected file can still contribute few expected tokens if the packed summary is too broad.
  • Vite and config/build tasks remain precision-sensitive. A change that improves Spring or MarkupSafe can still regress Vite because plausible source, playground, config, and test files compete for the same slots.
  • Slice wins are diagnostic, not release claims. The release claim needs a full public-suite run with exact recall and token precision.

What Improved

The precision target improved because the work separated the problem into ranking, selection, and packing failures instead of tuning one global score.

The most useful improvements were:

Change Why it helped
Literal definition matching Quoted API names such as parseAst should prefer the defining/exporting file over call-site noise.
Multi-term path ranking Tasks with several concrete path terms should reward files whose paths contain those terms, especially config files.
Conditional two-config cap Low-budget strict packs can include two strongly matched config files without opening the door to generic config noise.
Package-root source detection Monorepos often keep source under packages/<name>/... without a src/ segment. Those files need direct-source priority when evidence is strong.
Narrow root-Go strict support Root Go source files with conventional-scope source evidence recovered Gin recall without expanding the summary cap.
Same-package paired-test overflow Balanced no-live packs may include one extra packages/<name>/... test only when it directly tests an already selected source file and has direct content evidence. This recovered a NestJS expected test without a broad cap increase.
Same-playground test overflow Balanced no-live packs may include one extra playground/<name>/... test only when the same playground already has selected context plus scope and phrase/content evidence. This recovered one Vite playground test with near-neutral token precision.
JVM build metadata signal Java build/dependency tasks often expect root pom.xml or Gradle metadata. Root JVM build files now get a scoped boost.
Reason/family diagnostics reason_family_precision, selected family waste, failure type counts, and low-budget last-file waste made tuning evidence-backed.
Parent-checkout expected-file filtering Public history samples now exclude paths that do not exist in the parent checkout, so added files are not counted as selectable recall misses.

What Failed

These experiments were rejected because they helped one slice while hurting the full suite:

Rejected change Failure mode
Broad content-only concrete boost It over-selected noisy files with generic content hits and regressed TypeScript precision.
Treating .go files as release metadata It pulled version.go into unrelated Gin tasks and hurt Go precision.
Broad explicit-test cap increase It improved some recall but admitted too much test noise.
Specific-config pack suppression It cleaned one Tailwind config case, but regressed CSS/config tasks that still needed source files.
Broad build metadata boost for pyproject.toml and package files It fixed Spring-like tasks but hurt Python dependency/update cases, especially MarkupSafe.
Broad source strict-support exception It recovered a few Go source files but admitted non-expected Java source and spent precision margin. Narrowing to the measured root-Go pattern kept the gain and removed the Java regression.
Compact-cost selection priority for JS/TS skeletons It fixed one NestJS ordering issue, but reduced Vite precision by promoting plausible package tests over expected playground/config files. The safer retained rule is paired-test overflow only.
Weak package-test overflow A Vite worker.spec.ts test looked locally related but had only 3 content hits and no task phrase for an import glob task. Package test overflow now requires stronger task evidence to avoid this slot waste.

The rule from these failures: keep boosts scoped to the evidence family that proved the gain. Do not generalize from one benchmark case until the repo slice and full suite confirm it.

Benchmark Validity Guardrail

Public history cases run AgentPack against the parent of a real commit. A file added by that commit does not exist in the parent checkout, so AgentPack cannot select it. Sampled expected files must therefore be filtered to paths that exist in the parent commit.

This is a methodology correction, not a ranking improvement. It removes impossible EXPECTED_NOT_FOUND misses from sampled public cases while preserving modified/deleted files that are selectable from the parent checkout.

Static-List Guardrail

Static lists are acceptable only when they encode portable ecosystem conventions, not benchmark-case outcomes.

Acceptable examples:

Convention Why it is acceptable
Test filename forms such as _test.go, .spec.ts, and tests/ These are language and framework conventions used outside the benchmark suite.
Root build metadata such as pom.xml, Gradle files, and package manifests These are standard project files for dependency/build tasks.
File extensions and generated/example/test directory names These describe broad file families, not one repo's expected answer.

Risky examples:

Shortcut Why it is risky
Boosting a specific filename because one public case expected it It can inflate one slice while hurting unrelated tasks.
Adding repo-shaped path rules such as a known package or fixture directory It makes benchmark numbers less trustworthy and does not generalize.
Treating a language extension as task metadata without stronger evidence It admits plausible-but-not-actionable files and wastes selection slots.

New ranking rules should prefer dynamic evidence: task literals, symbol definitions, imports/calls, same-package locality, dependency edges, changed-file history, and measured reason-family precision. If a static convention is added, it needs a non-benchmark rationale plus negative tests that prove it does not revive known noisy families.

Slice Readout

The latest useful slice picture:

Slice Status
Python CLI/library No longer the main precision bottleneck. Click and itsdangerous are healthy; MarkupSafe recovered after narrowing build metadata.
TypeScript/Vite Same-playground test overflow moved the current Vite slice from 44.5% to 45.7% recall with token precision essentially neutral at 45.5% to 45.4%. Config/source ranking and ranked-low expected files remain the main blockers.
Go/Gin Improved after root Go source and test-path handling. Latest targeted slice moved from 48.1% to 55.6% recall while token precision rose slightly from 46.5% to 47.3%.
Java/Spring Strong gain from scoped JVM build metadata.
TypeScript monorepo/NestJS Same-package paired-test overflow improved the current scored NestJS slice from 62.5% to 70.8% recall and from 40.5% to 44.0% token precision, but the slice remains below the 50% precision target.

Benchmark Loop Guardrail

Full public-suite runs are expensive enough that experiments should not use them as the first diagnostic. Use this loop instead:

  1. Establish one full-suite baseline JSONL.
  2. Inspect case-level misses and choose the affected repo or task-type slices.
  3. Run only those slices with --public-repo-filter or --public-task-type-filter and write --benchmark-jsonl.
  4. Compare case-level recall, token precision, selected paths, and miss status.
  5. Run the full public suite once only after the slice result has a credible chance of improving the aggregate gate.

Example:

agentpack benchmark \
  --public-repos \
  --public-repo-filter gin,spring-petclinic \
  --public-repos-cache /tmp/agentpack-public-cache-full \
  --benchmark-jsonl /tmp/agentpack-go-java.jsonl \
  --misses

Current Mental Model

Do not ask only whether recall went up. Ask where the expected file was lost:

  1. Candidate generation: was the expected file found at all?
  2. Ranking: did it reach the top candidates?
  3. Selection: did the pack choose it under budget and family caps?
  4. Packing: did selected tokens include useful expected-file content or mostly noise?

Useful recall is not just "the file appeared somewhere." It means the expected file appeared early, survived selection, and contributed enough useful tokens.

Metrics To Keep Watching

These metrics were the most useful during tuning:

Metric Use
candidate_recall_at_50 Separates discovery failures from ranking/selection failures.
candidate_precision_at_3 Shows whether noisy files dominate the top of the ranked list.
token_precision Main measure for packed-token usefulness.
expected_token_coverage Approximates valuable recall for selected expected files.
selected_family_waste_tokens Shows whether source, test, docs, config, fixtures, or generated files leak noise.
reason_family_precision Shows which ranking reasons are trustworthy.
failure_type_counts Splits misses into not found, ranked low, skipped, or noise selected above expected.
precision_delta_if_drop_last_summary Identifies low-budget extra-file waste.

Next Optimization Areas

The 2026-06-13 local checkpoint clears the immediate release target:

Metric Target
Expanded public-suite recall 66.0%
Expanded public-suite token precision 51.1%
Major language/task-slice recall regression <= 2 points
EXPECTED_NOT_FOUND sampled-public misses 0

The next release should treat this as a achieved-but-thin margin, not as room for broad expansion. Token precision has only about one point of headroom above the 50% floor, so the next changes should start with diagnostics and targeted slice validation.

The next precision/recall work should focus on these areas:

  1. Config/build intent recovery. config_build is the main remaining bottleneck: 48.8% recall and 35.0% token precision in the latest full run. Do not raise config caps broadly. First diagnose config-only, source-only, and config-plus-source tasks.

  2. Vite config/source balance. Some config tasks need only config files, while CSS/build tasks need both source and config. Selection needs conditional inclusion, not global suppression.

  3. NestJS wrong-package locality. Monorepo tasks need better package/workspace intent detection so packages/core and integration/* do not steal from each other unless the task evidence says so.

  4. Snippet/block packing. Several cases select the expected file but include too much surrounding noise. Function, class, config-section, and diff-hunk packing should improve token precision without sacrificing selected-file recall.

  5. Publish decision gates. Keep changes only when full-suite recall stays at or above 65%, token precision stays at or above 51%, and no major language/task slice regresses by more than 2 recall points.