Files
kernbench2/docs/adr/ADR-0054-eval-milestone-benches.md
T
mukesh cc1bbd0ab7 eval: fold GEMM/allreduce harnesses into self-contained milestone benches
Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/
into two self-contained eval benches so a user can regenerate every
result + figure with one command:

  kernbench run --bench milestone-1h-gemm   (MILESTONE_FAST=1 reuses JSON)
  kernbench run --bench milestone-1h-ccl

- benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the
  run(torch) entry drives the sweeps and writes figures into
  benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a
  sentinel tensor to satisfy the run_bench contract.
- tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin
  re-export/wrapper shims over the benches (single source preserved); the
  pytest-only param builders + _run_distributed wrapper stay in the shim.
- eval-bench pattern: a bench may drive many configs + build its own
  per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2).

ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI
Semantics amended; ADR INDEX regenerated. Verified: milestone benches run
clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 15:19:52 -07:00

142 lines
6.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0054: Milestone Eval Benches — self-contained sweep + figure benches
## Status
Accepted (2026-05-22).
Amends ADR-0044 (D1/D2) and ADR-0045 (D5) and supersedes the "logic lives
in `scripts/` + `tests/`" arrangement of ADR-0043/0044: the GEMM and
allreduce evaluation harnesses are now self-contained **benches** that a
user runs to regenerate every result + figure.
## Context
ADR-0043 (allreduce eval) and ADR-0044 (GEMM eval) split each harness into
a **sweep** (a manual `scripts/` driver, or — for allreduce — the
parametrized tests themselves) plus **figure tests** that render committed
data. The sweep/render logic therefore lived under `scripts/gemm_sweep.py`,
`tests/gemm/_gemm_plot_helpers.py`, and `tests/sccl/_allreduce_helpers.py`.
A milestone requirement ("refactor allreduce + GEMM evaluation so a user
can run *one bench* to generate all the results and plots") cannot be met
by that layout: a bench is production code and **must not import from
`tests/`** (ADR-0007 layer direction). The eval logic had to move into
production, reachable from a bench.
The chosen home is the bench module itself — not a separate
`kernbench.eval` package. A bench file may contain arbitrary module-level
code; collapsing the harness into the bench keeps one file per domain and
avoids an extra package layer.
## Decision
### D1. Two milestone benches own the eval logic
- `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep +
the three figure renderers (moved from `scripts/gemm_sweep.py` +
`tests/gemm/_gemm_plot_helpers.py`).
- `src/kernbench/benches/milestone_1h_ccl.py` — the distributed allreduce
driver, latency + buffer-kind sweeps, topology diagram, FSIM comparison,
and the direct-launch parity reference (moved from
`tests/sccl/_allreduce_helpers.py`).
Each file is the **single home** for its domain's eval logic.
### D2. The "eval bench" pattern (extends ADR-0045 D5)
ADR-0045 D5 fixed a bench to a single configuration (single-SIP, or the
ADR-0024 multi-SIP CCL exception). This ADR adds a third pattern:
- An **eval bench** may drive *many* configurations and render figures. It
builds its own per-config `GraphEngine` / `RuntimeContext` instances
(one per sweep point) rather than using the outer `run_bench` engine.
- Because the outer ctx then has no submitted handles, the bench submits a
**sentinel tensor** (`torch.zeros((1, 1), …)`) at the end to satisfy
`run_bench`'s "must submit at least one request" contract (ADR-0045 D4),
so the CLI exits 0.
### D3. Output location
Both benches write to `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`
(per user request — artifacts beside the bench). The directory holds only
generated PNG/CSV/JSON (never a `.py`/`__init__.py`), so the eager-import
audit (ADR-0045 first action) ignores it — `pkgutil.iter_modules` does not
yield non-package subdirectories. It is **git-ignored** (regenerable on
demand), unlike the committed `docs/diagrams/` artifacts.
### D4. GEMM heavy sweep — fresh by default, `MILESTONE_FAST` to reuse
`milestone-1h-gemm` runs the full 24-sim sweep by default (minutes; one
shape is 2048 tiles). `MILESTONE_FAST=1` reuses the committed
`docs/diagrams/gemm_sweep.json` and only re-renders (seconds). This
reverses ADR-0044 D1/D2's "heavy sweep stays a manual/`slow`-marked step":
running the bench *is* the regeneration. The slow path is exercised by a
`@pytest.mark.slow` bench test; the fast path runs by default.
### D5. Tests + script reuse via thin re-export shims (single home kept)
The pre-existing figure tests and the `scripts/gemm_sweep.py` entry point
are retained and now reuse the bench modules:
- `tests/gemm/_gemm_plot_helpers.py` → re-exports the renderers +
`GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT` from
`kernbench.benches.milestone_1h_gemm`.
- `tests/sccl/_allreduce_helpers.py` → re-exports the driver core, config
writers, sweep constants, renderers, and disk aggregators from
`kernbench.benches.milestone_1h_ccl`, and keeps the **pytest-only** pieces
local: the `pytest.param` matrices (`CONFIGS` / `_sweep_params` /
`_bk_params`) and the fixture-coupled `_run_distributed`
(`monkeypatch.chdir` + `_drive_distributed`) wrapper.
- `scripts/gemm_sweep.py` → thin wrapper over the bench's `run_sweep`.
Tests importing a bench module is permitted (tests sit above production,
ADR-0007); it triggers the whole-package eager audit, which already runs on
every `kernbench` invocation. matplotlib stays lazily imported inside the
renderers, so the audit's startup cost is unchanged.
### D6. Flat module naming (no `benches/` subfolder)
A `benches/` subpackage named `1H_milestone…` is impossible — a Python
package name cannot start with a digit. The benches are therefore flat
modules `milestone_1h_gemm.py` / `milestone_1h_ccl.py` with bench names
`milestone-1h-gemm` / `milestone-1h-ccl` (kebab-case, letter-first per
ADR-0045 D1).
## Consequences
### Positive
- `kernbench run --bench milestone-1h-gemm` (or `…-ccl`) regenerates all of
a domain's results + figures in one command — the milestone requirement.
- Single source for the eval logic (the bench), reused by tests and the
script via shims; no duplication.
- The figure tests and `scripts/gemm_sweep.py` keep working unchanged.
### Negative / limitations
- The two bench files are large (the CCL one mixes the distributed driver,
sweeps, and matplotlib drawing). A "bench" that is mostly an eval harness
is unusual; this ADR legitimizes it.
- Generated artifacts live inside the source tree (`src/kernbench/benches/`)
by explicit request; git-ignored to avoid committing them.
- `milestone-1h-ccl` (and the default `milestone-1h-gemm`) take minutes —
acceptable for an on-demand milestone artifact, not for routine runs.
## Dependencies
- **ADR-0007**: layer direction (why tests may import production but a bench
may not import tests).
- **ADR-0043 / ADR-0044**: the allreduce / GEMM eval harnesses this ADR
relocates into benches.
- **ADR-0045**: bench module contract; D2 here extends its D5 (single-device
rule) with the eval-bench pattern, and relies on D4 (NO_REQUESTS) for the
sentinel.
- **ADR-0024**: rank = SIP launcher driven by the allreduce sweeps.
## Open questions
- Should the GEMM theoretical-model constants (ADR-0044 D5) be sourced from
ADR-0033/0014 rather than copied? Unchanged by this ADR.
- Should `build_overview_slides.py` consume the milestone output PNGs
instead of drawing GEMM bars natively? Still open (ADR-0044 D6 / Negative).