Files
kernbench2/docs/adr/ADR-0054-eval-milestone-benches.md
T
mukesh b1d6fafd3a eval: commit milestone bench output (track generated figures + results)
Per request, the milestone bench output is now tracked in git instead of
gitignored, so the figures/results are viewable on the remote:

- src/kernbench/benches/1H_milestone_output/gemm/  (3 PNGs + gemm_sweep.json)
- src/kernbench/benches/1H_milestone_output/ccl/   (3 per-topology PNGs,
  buffer-kind PNG+CSV, FSIM comparison PNG, topology.png, summary.csv)

Drop the .gitignore rule; update ADR-0054 D3 + Negative (EN+KO) to say the
output is committed (regenerable by rerunning the bench). Artifacts produced
by full bench runs (milestone-1h-gemm non-FAST, milestone-1h-ccl).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 15:37:27 -07:00

6.5 KiB
Raw Blame History

ADR-0054: Milestone Eval Benches — self-contained sweep + figure benches

Status

Accepted (2026-05-22).

Amends ADR-0044 (D1/D2) and ADR-0045 (D5) and supersedes the "logic lives in scripts/ + tests/" arrangement of ADR-0043/0044: the GEMM and allreduce evaluation harnesses are now self-contained benches that a user runs to regenerate every result + figure.

Context

ADR-0043 (allreduce eval) and ADR-0044 (GEMM eval) split each harness into a sweep (a manual scripts/ driver, or — for allreduce — the parametrized tests themselves) plus figure tests that render committed data. The sweep/render logic therefore lived under scripts/gemm_sweep.py, tests/gemm/_gemm_plot_helpers.py, and tests/sccl/_allreduce_helpers.py.

A milestone requirement ("refactor allreduce + GEMM evaluation so a user can run one bench to generate all the results and plots") cannot be met by that layout: a bench is production code and must not import from tests/ (ADR-0007 layer direction). The eval logic had to move into production, reachable from a bench.

The chosen home is the bench module itself — not a separate kernbench.eval package. A bench file may contain arbitrary module-level code; collapsing the harness into the bench keeps one file per domain and avoids an extra package layer.

Decision

D1. Two milestone benches own the eval logic

  • src/kernbench/benches/milestone_1h_gemm.py — GEMM shape×variant sweep + the three figure renderers (moved from scripts/gemm_sweep.py + tests/gemm/_gemm_plot_helpers.py).
  • src/kernbench/benches/milestone_1h_ccl.py — the distributed allreduce driver, latency + buffer-kind sweeps, topology diagram, FSIM comparison, and the direct-launch parity reference (moved from tests/sccl/_allreduce_helpers.py).

Each file is the single home for its domain's eval logic.

D2. The "eval bench" pattern (extends ADR-0045 D5)

ADR-0045 D5 fixed a bench to a single configuration (single-SIP, or the ADR-0024 multi-SIP CCL exception). This ADR adds a third pattern:

  • An eval bench may drive many configurations and render figures. It builds its own per-config GraphEngine / RuntimeContext instances (one per sweep point) rather than using the outer run_bench engine.
  • Because the outer ctx then has no submitted handles, the bench submits a sentinel tensor (torch.zeros((1, 1), …)) at the end to satisfy run_bench's "must submit at least one request" contract (ADR-0045 D4), so the CLI exits 0.

D3. Output location

Both benches write to src/kernbench/benches/1H_milestone_output/{gemm,ccl}/ (per user request — artifacts beside the bench). The directory holds only generated PNG/CSV/JSON (never a .py/__init__.py), so the eager-import audit (ADR-0045 first action) ignores it — pkgutil.iter_modules does not yield non-package subdirectories. It is committed (like the docs/diagrams/ artifacts) so the figures are viewable on the remote; rerunning the bench regenerates it in place.

D4. GEMM heavy sweep — fresh by default, MILESTONE_FAST to reuse

milestone-1h-gemm runs the full 24-sim sweep by default (minutes; one shape is 2048 tiles). MILESTONE_FAST=1 reuses the committed docs/diagrams/gemm_sweep.json and only re-renders (seconds). This reverses ADR-0044 D1/D2's "heavy sweep stays a manual/slow-marked step": running the bench is the regeneration. The slow path is exercised by a @pytest.mark.slow bench test; the fast path runs by default.

D5. Tests + script reuse via thin re-export shims (single home kept)

The pre-existing figure tests and the scripts/gemm_sweep.py entry point are retained and now reuse the bench modules:

  • tests/gemm/_gemm_plot_helpers.py → re-exports the renderers + GEMM_SWEEP_JSON/GEMM_PLOTS_DIR/ROOT from kernbench.benches.milestone_1h_gemm.
  • tests/sccl/_allreduce_helpers.py → re-exports the driver core, config writers, sweep constants, renderers, and disk aggregators from kernbench.benches.milestone_1h_ccl, and keeps the pytest-only pieces local: the pytest.param matrices (CONFIGS / _sweep_params / _bk_params) and the fixture-coupled _run_distributed (monkeypatch.chdir + _drive_distributed) wrapper.
  • scripts/gemm_sweep.py → thin wrapper over the bench's run_sweep.

Tests importing a bench module is permitted (tests sit above production, ADR-0007); it triggers the whole-package eager audit, which already runs on every kernbench invocation. matplotlib stays lazily imported inside the renderers, so the audit's startup cost is unchanged.

D6. Flat module naming (no benches/ subfolder)

A benches/ subpackage named 1H_milestone… is impossible — a Python package name cannot start with a digit. The benches are therefore flat modules milestone_1h_gemm.py / milestone_1h_ccl.py with bench names milestone-1h-gemm / milestone-1h-ccl (kebab-case, letter-first per ADR-0045 D1).

Consequences

Positive

  • kernbench run --bench milestone-1h-gemm (or …-ccl) regenerates all of a domain's results + figures in one command — the milestone requirement.
  • Single source for the eval logic (the bench), reused by tests and the script via shims; no duplication.
  • The figure tests and scripts/gemm_sweep.py keep working unchanged.

Negative / limitations

  • The two bench files are large (the CCL one mixes the distributed driver, sweeps, and matplotlib drawing). A "bench" that is mostly an eval harness is unusual; this ADR legitimizes it.
  • Generated artifacts live inside the source tree (src/kernbench/benches/) by explicit request and are committed (so the figures are viewable on the remote); rerunning the bench regenerates them.
  • milestone-1h-ccl (and the default milestone-1h-gemm) take minutes — acceptable for an on-demand milestone artifact, not for routine runs.

Dependencies

  • ADR-0007: layer direction (why tests may import production but a bench may not import tests).
  • ADR-0043 / ADR-0044: the allreduce / GEMM eval harnesses this ADR relocates into benches.
  • ADR-0045: bench module contract; D2 here extends its D5 (single-device rule) with the eval-bench pattern, and relies on D4 (NO_REQUESTS) for the sentinel.
  • ADR-0024: rank = SIP launcher driven by the allreduce sweeps.

Open questions

  • Should the GEMM theoretical-model constants (ADR-0044 D5) be sourced from ADR-0033/0014 rather than copied? Unchanged by this ADR.
  • Should build_overview_slides.py consume the milestone output PNGs instead of drawing GEMM bars natively? Still open (ADR-0044 D6 / Negative).