# ADR-0054: Milestone Eval Benches — self-contained sweep + figure benches ## Status Accepted (2026-05-22). Amends ADR-0044 (D1/D2) and ADR-0045 (D5) and supersedes the "logic lives in `scripts/` + `tests/`" arrangement of ADR-0043/0044: the GEMM and allreduce evaluation harnesses are now self-contained **benches** that a user runs to regenerate every result + figure. ## Context ADR-0043 (allreduce eval) and ADR-0044 (GEMM eval) split each harness into a **sweep** (a manual `scripts/` driver, or — for allreduce — the parametrized tests themselves) plus **figure tests** that render committed data. The sweep/render logic therefore lived under `scripts/gemm_sweep.py`, `tests/gemm/_gemm_plot_helpers.py`, and `tests/sccl/_allreduce_helpers.py`. A milestone requirement ("refactor allreduce + GEMM evaluation so a user can run *one bench* to generate all the results and plots") cannot be met by that layout: a bench is production code and **must not import from `tests/`** (ADR-0007 layer direction). The eval logic had to move into production, reachable from a bench. The chosen home is the bench module itself — not a separate `kernbench.eval` package. A bench file may contain arbitrary module-level code; collapsing the harness into the bench keeps one file per domain and avoids an extra package layer. ## Decision ### D1. Two milestone benches own the eval logic - `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep + the three figure renderers (moved from `scripts/gemm_sweep.py` + `tests/gemm/_gemm_plot_helpers.py`). - `src/kernbench/benches/milestone_1h_ccl.py` — the distributed allreduce driver, latency + buffer-kind sweeps, topology diagram, FSIM comparison, and the direct-launch parity reference (moved from `tests/sccl/_allreduce_helpers.py`). Each file is the **single home** for its domain's eval logic. ### D2. The "eval bench" pattern (extends ADR-0045 D5) ADR-0045 D5 fixed a bench to a single configuration (single-SIP, or the ADR-0024 multi-SIP CCL exception). This ADR adds a third pattern: - An **eval bench** may drive *many* configurations and render figures. It builds its own per-config `GraphEngine` / `RuntimeContext` instances (one per sweep point) rather than using the outer `run_bench` engine. - Because the outer ctx then has no submitted handles, the bench submits a **sentinel tensor** (`torch.zeros((1, 1), …)`) at the end to satisfy `run_bench`'s "must submit at least one request" contract (ADR-0045 D4), so the CLI exits 0. ### D3. Output location Both benches write to `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/` (per user request — artifacts beside the bench). The directory holds only generated PNG/CSV/JSON (never a `.py`/`__init__.py`), so the eager-import audit (ADR-0045 first action) ignores it — `pkgutil.iter_modules` does not yield non-package subdirectories. It is **git-ignored** (regenerable on demand), unlike the committed `docs/diagrams/` artifacts. ### D4. GEMM heavy sweep — fresh by default, `MILESTONE_FAST` to reuse `milestone-1h-gemm` runs the full 24-sim sweep by default (minutes; one shape is 2048 tiles). `MILESTONE_FAST=1` reuses the committed `docs/diagrams/gemm_sweep.json` and only re-renders (seconds). This reverses ADR-0044 D1/D2's "heavy sweep stays a manual/`slow`-marked step": running the bench *is* the regeneration. The slow path is exercised by a `@pytest.mark.slow` bench test; the fast path runs by default. ### D5. Tests + script reuse via thin re-export shims (single home kept) The pre-existing figure tests and the `scripts/gemm_sweep.py` entry point are retained and now reuse the bench modules: - `tests/gemm/_gemm_plot_helpers.py` → re-exports the renderers + `GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT` from `kernbench.benches.milestone_1h_gemm`. - `tests/sccl/_allreduce_helpers.py` → re-exports the driver core, config writers, sweep constants, renderers, and disk aggregators from `kernbench.benches.milestone_1h_ccl`, and keeps the **pytest-only** pieces local: the `pytest.param` matrices (`CONFIGS` / `_sweep_params` / `_bk_params`) and the fixture-coupled `_run_distributed` (`monkeypatch.chdir` + `_drive_distributed`) wrapper. - `scripts/gemm_sweep.py` → thin wrapper over the bench's `run_sweep`. Tests importing a bench module is permitted (tests sit above production, ADR-0007); it triggers the whole-package eager audit, which already runs on every `kernbench` invocation. matplotlib stays lazily imported inside the renderers, so the audit's startup cost is unchanged. ### D6. Flat module naming (no `benches/` subfolder) A `benches/` subpackage named `1H_milestone…` is impossible — a Python package name cannot start with a digit. The benches are therefore flat modules `milestone_1h_gemm.py` / `milestone_1h_ccl.py` with bench names `milestone-1h-gemm` / `milestone-1h-ccl` (kebab-case, letter-first per ADR-0045 D1). ## Consequences ### Positive - `kernbench run --bench milestone-1h-gemm` (or `…-ccl`) regenerates all of a domain's results + figures in one command — the milestone requirement. - Single source for the eval logic (the bench), reused by tests and the script via shims; no duplication. - The figure tests and `scripts/gemm_sweep.py` keep working unchanged. ### Negative / limitations - The two bench files are large (the CCL one mixes the distributed driver, sweeps, and matplotlib drawing). A "bench" that is mostly an eval harness is unusual; this ADR legitimizes it. - Generated artifacts live inside the source tree (`src/kernbench/benches/`) by explicit request; git-ignored to avoid committing them. - `milestone-1h-ccl` (and the default `milestone-1h-gemm`) take minutes — acceptable for an on-demand milestone artifact, not for routine runs. ## Dependencies - **ADR-0007**: layer direction (why tests may import production but a bench may not import tests). - **ADR-0043 / ADR-0044**: the allreduce / GEMM eval harnesses this ADR relocates into benches. - **ADR-0045**: bench module contract; D2 here extends its D5 (single-device rule) with the eval-bench pattern, and relies on D4 (NO_REQUESTS) for the sentinel. - **ADR-0024**: rank = SIP launcher driven by the allreduce sweeps. ## Open questions - Should the GEMM theoretical-model constants (ADR-0044 D5) be sourced from ADR-0033/0014 rather than copied? Unchanged by this ADR. - Should `build_overview_slides.py` consume the milestone output PNGs instead of drawing GEMM bars natively? Still open (ADR-0044 D6 / Negative).