Files
kernbench2/docs/adr/ADR-0054-eval-milestone-benches.md
T
mukesh b1d6fafd3a eval: commit milestone bench output (track generated figures + results)
Per request, the milestone bench output is now tracked in git instead of
gitignored, so the figures/results are viewable on the remote:

- src/kernbench/benches/1H_milestone_output/gemm/  (3 PNGs + gemm_sweep.json)
- src/kernbench/benches/1H_milestone_output/ccl/   (3 per-topology PNGs,
  buffer-kind PNG+CSV, FSIM comparison PNG, topology.png, summary.csv)

Drop the .gitignore rule; update ADR-0054 D3 + Negative (EN+KO) to say the
output is committed (regenerable by rerunning the bench). Artifacts produced
by full bench runs (milestone-1h-gemm non-FAST, milestone-1h-ccl).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 15:37:27 -07:00

144 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0054: Milestone Eval Benches — self-contained sweep + figure benches
## Status
Accepted (2026-05-22).
Amends ADR-0044 (D1/D2) and ADR-0045 (D5) and supersedes the "logic lives
in `scripts/` + `tests/`" arrangement of ADR-0043/0044: the GEMM and
allreduce evaluation harnesses are now self-contained **benches** that a
user runs to regenerate every result + figure.
## Context
ADR-0043 (allreduce eval) and ADR-0044 (GEMM eval) split each harness into
a **sweep** (a manual `scripts/` driver, or — for allreduce — the
parametrized tests themselves) plus **figure tests** that render committed
data. The sweep/render logic therefore lived under `scripts/gemm_sweep.py`,
`tests/gemm/_gemm_plot_helpers.py`, and `tests/sccl/_allreduce_helpers.py`.
A milestone requirement ("refactor allreduce + GEMM evaluation so a user
can run *one bench* to generate all the results and plots") cannot be met
by that layout: a bench is production code and **must not import from
`tests/`** (ADR-0007 layer direction). The eval logic had to move into
production, reachable from a bench.
The chosen home is the bench module itself — not a separate
`kernbench.eval` package. A bench file may contain arbitrary module-level
code; collapsing the harness into the bench keeps one file per domain and
avoids an extra package layer.
## Decision
### D1. Two milestone benches own the eval logic
- `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep +
the three figure renderers (moved from `scripts/gemm_sweep.py` +
`tests/gemm/_gemm_plot_helpers.py`).
- `src/kernbench/benches/milestone_1h_ccl.py` — the distributed allreduce
driver, latency + buffer-kind sweeps, topology diagram, FSIM comparison,
and the direct-launch parity reference (moved from
`tests/sccl/_allreduce_helpers.py`).
Each file is the **single home** for its domain's eval logic.
### D2. The "eval bench" pattern (extends ADR-0045 D5)
ADR-0045 D5 fixed a bench to a single configuration (single-SIP, or the
ADR-0024 multi-SIP CCL exception). This ADR adds a third pattern:
- An **eval bench** may drive *many* configurations and render figures. It
builds its own per-config `GraphEngine` / `RuntimeContext` instances
(one per sweep point) rather than using the outer `run_bench` engine.
- Because the outer ctx then has no submitted handles, the bench submits a
**sentinel tensor** (`torch.zeros((1, 1), …)`) at the end to satisfy
`run_bench`'s "must submit at least one request" contract (ADR-0045 D4),
so the CLI exits 0.
### D3. Output location
Both benches write to `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`
(per user request — artifacts beside the bench). The directory holds only
generated PNG/CSV/JSON (never a `.py`/`__init__.py`), so the eager-import
audit (ADR-0045 first action) ignores it — `pkgutil.iter_modules` does not
yield non-package subdirectories. It is **committed** (like the
`docs/diagrams/` artifacts) so the figures are viewable on the remote;
rerunning the bench regenerates it in place.
### D4. GEMM heavy sweep — fresh by default, `MILESTONE_FAST` to reuse
`milestone-1h-gemm` runs the full 24-sim sweep by default (minutes; one
shape is 2048 tiles). `MILESTONE_FAST=1` reuses the committed
`docs/diagrams/gemm_sweep.json` and only re-renders (seconds). This
reverses ADR-0044 D1/D2's "heavy sweep stays a manual/`slow`-marked step":
running the bench *is* the regeneration. The slow path is exercised by a
`@pytest.mark.slow` bench test; the fast path runs by default.
### D5. Tests + script reuse via thin re-export shims (single home kept)
The pre-existing figure tests and the `scripts/gemm_sweep.py` entry point
are retained and now reuse the bench modules:
- `tests/gemm/_gemm_plot_helpers.py` → re-exports the renderers +
`GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT` from
`kernbench.benches.milestone_1h_gemm`.
- `tests/sccl/_allreduce_helpers.py` → re-exports the driver core, config
writers, sweep constants, renderers, and disk aggregators from
`kernbench.benches.milestone_1h_ccl`, and keeps the **pytest-only** pieces
local: the `pytest.param` matrices (`CONFIGS` / `_sweep_params` /
`_bk_params`) and the fixture-coupled `_run_distributed`
(`monkeypatch.chdir` + `_drive_distributed`) wrapper.
- `scripts/gemm_sweep.py` → thin wrapper over the bench's `run_sweep`.
Tests importing a bench module is permitted (tests sit above production,
ADR-0007); it triggers the whole-package eager audit, which already runs on
every `kernbench` invocation. matplotlib stays lazily imported inside the
renderers, so the audit's startup cost is unchanged.
### D6. Flat module naming (no `benches/` subfolder)
A `benches/` subpackage named `1H_milestone…` is impossible — a Python
package name cannot start with a digit. The benches are therefore flat
modules `milestone_1h_gemm.py` / `milestone_1h_ccl.py` with bench names
`milestone-1h-gemm` / `milestone-1h-ccl` (kebab-case, letter-first per
ADR-0045 D1).
## Consequences
### Positive
- `kernbench run --bench milestone-1h-gemm` (or `…-ccl`) regenerates all of
a domain's results + figures in one command — the milestone requirement.
- Single source for the eval logic (the bench), reused by tests and the
script via shims; no duplication.
- The figure tests and `scripts/gemm_sweep.py` keep working unchanged.
### Negative / limitations
- The two bench files are large (the CCL one mixes the distributed driver,
sweeps, and matplotlib drawing). A "bench" that is mostly an eval harness
is unusual; this ADR legitimizes it.
- Generated artifacts live inside the source tree (`src/kernbench/benches/`)
by explicit request and are committed (so the figures are viewable on the
remote); rerunning the bench regenerates them.
- `milestone-1h-ccl` (and the default `milestone-1h-gemm`) take minutes —
acceptable for an on-demand milestone artifact, not for routine runs.
## Dependencies
- **ADR-0007**: layer direction (why tests may import production but a bench
may not import tests).
- **ADR-0043 / ADR-0044**: the allreduce / GEMM eval harnesses this ADR
relocates into benches.
- **ADR-0045**: bench module contract; D2 here extends its D5 (single-device
rule) with the eval-bench pattern, and relies on D4 (NO_REQUESTS) for the
sentinel.
- **ADR-0024**: rank = SIP launcher driven by the allreduce sweeps.
## Open questions
- Should the GEMM theoretical-model constants (ADR-0044 D5) be sourced from
ADR-0033/0014 rather than copied? Unchanged by this ADR.
- Should `build_overview_slides.py` consume the milestone output PNGs
instead of drawing GEMM bars natively? Still open (ADR-0044 D6 / Negative).