eval: fold GEMM/allreduce harnesses into self-contained milestone benches
Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/
into two self-contained eval benches so a user can regenerate every
result + figure with one command:
kernbench run --bench milestone-1h-gemm (MILESTONE_FAST=1 reuses JSON)
kernbench run --bench milestone-1h-ccl
- benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the
run(torch) entry drives the sweeps and writes figures into
benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a
sentinel tensor to satisfy the run_bench contract.
- tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin
re-export/wrapper shims over the benches (single source preserved); the
pytest-only param builders + _run_distributed wrapper stay in the shim.
- eval-bench pattern: a bench may drive many configs + build its own
per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2).
ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI
Semantics amended; ADR INDEX regenerated. Verified: milestone benches run
clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,141 @@
|
||||
# ADR-0054: Milestone Eval Benches — self-contained sweep + figure benches
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (2026-05-22).
|
||||
|
||||
Amends ADR-0044 (D1/D2) and ADR-0045 (D5) and supersedes the "logic lives
|
||||
in `scripts/` + `tests/`" arrangement of ADR-0043/0044: the GEMM and
|
||||
allreduce evaluation harnesses are now self-contained **benches** that a
|
||||
user runs to regenerate every result + figure.
|
||||
|
||||
## Context
|
||||
|
||||
ADR-0043 (allreduce eval) and ADR-0044 (GEMM eval) split each harness into
|
||||
a **sweep** (a manual `scripts/` driver, or — for allreduce — the
|
||||
parametrized tests themselves) plus **figure tests** that render committed
|
||||
data. The sweep/render logic therefore lived under `scripts/gemm_sweep.py`,
|
||||
`tests/gemm/_gemm_plot_helpers.py`, and `tests/sccl/_allreduce_helpers.py`.
|
||||
|
||||
A milestone requirement ("refactor allreduce + GEMM evaluation so a user
|
||||
can run *one bench* to generate all the results and plots") cannot be met
|
||||
by that layout: a bench is production code and **must not import from
|
||||
`tests/`** (ADR-0007 layer direction). The eval logic had to move into
|
||||
production, reachable from a bench.
|
||||
|
||||
The chosen home is the bench module itself — not a separate
|
||||
`kernbench.eval` package. A bench file may contain arbitrary module-level
|
||||
code; collapsing the harness into the bench keeps one file per domain and
|
||||
avoids an extra package layer.
|
||||
|
||||
## Decision
|
||||
|
||||
### D1. Two milestone benches own the eval logic
|
||||
|
||||
- `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep +
|
||||
the three figure renderers (moved from `scripts/gemm_sweep.py` +
|
||||
`tests/gemm/_gemm_plot_helpers.py`).
|
||||
- `src/kernbench/benches/milestone_1h_ccl.py` — the distributed allreduce
|
||||
driver, latency + buffer-kind sweeps, topology diagram, FSIM comparison,
|
||||
and the direct-launch parity reference (moved from
|
||||
`tests/sccl/_allreduce_helpers.py`).
|
||||
|
||||
Each file is the **single home** for its domain's eval logic.
|
||||
|
||||
### D2. The "eval bench" pattern (extends ADR-0045 D5)
|
||||
|
||||
ADR-0045 D5 fixed a bench to a single configuration (single-SIP, or the
|
||||
ADR-0024 multi-SIP CCL exception). This ADR adds a third pattern:
|
||||
|
||||
- An **eval bench** may drive *many* configurations and render figures. It
|
||||
builds its own per-config `GraphEngine` / `RuntimeContext` instances
|
||||
(one per sweep point) rather than using the outer `run_bench` engine.
|
||||
- Because the outer ctx then has no submitted handles, the bench submits a
|
||||
**sentinel tensor** (`torch.zeros((1, 1), …)`) at the end to satisfy
|
||||
`run_bench`'s "must submit at least one request" contract (ADR-0045 D4),
|
||||
so the CLI exits 0.
|
||||
|
||||
### D3. Output location
|
||||
|
||||
Both benches write to `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`
|
||||
(per user request — artifacts beside the bench). The directory holds only
|
||||
generated PNG/CSV/JSON (never a `.py`/`__init__.py`), so the eager-import
|
||||
audit (ADR-0045 first action) ignores it — `pkgutil.iter_modules` does not
|
||||
yield non-package subdirectories. It is **git-ignored** (regenerable on
|
||||
demand), unlike the committed `docs/diagrams/` artifacts.
|
||||
|
||||
### D4. GEMM heavy sweep — fresh by default, `MILESTONE_FAST` to reuse
|
||||
|
||||
`milestone-1h-gemm` runs the full 24-sim sweep by default (minutes; one
|
||||
shape is 2048 tiles). `MILESTONE_FAST=1` reuses the committed
|
||||
`docs/diagrams/gemm_sweep.json` and only re-renders (seconds). This
|
||||
reverses ADR-0044 D1/D2's "heavy sweep stays a manual/`slow`-marked step":
|
||||
running the bench *is* the regeneration. The slow path is exercised by a
|
||||
`@pytest.mark.slow` bench test; the fast path runs by default.
|
||||
|
||||
### D5. Tests + script reuse via thin re-export shims (single home kept)
|
||||
|
||||
The pre-existing figure tests and the `scripts/gemm_sweep.py` entry point
|
||||
are retained and now reuse the bench modules:
|
||||
|
||||
- `tests/gemm/_gemm_plot_helpers.py` → re-exports the renderers +
|
||||
`GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT` from
|
||||
`kernbench.benches.milestone_1h_gemm`.
|
||||
- `tests/sccl/_allreduce_helpers.py` → re-exports the driver core, config
|
||||
writers, sweep constants, renderers, and disk aggregators from
|
||||
`kernbench.benches.milestone_1h_ccl`, and keeps the **pytest-only** pieces
|
||||
local: the `pytest.param` matrices (`CONFIGS` / `_sweep_params` /
|
||||
`_bk_params`) and the fixture-coupled `_run_distributed`
|
||||
(`monkeypatch.chdir` + `_drive_distributed`) wrapper.
|
||||
- `scripts/gemm_sweep.py` → thin wrapper over the bench's `run_sweep`.
|
||||
|
||||
Tests importing a bench module is permitted (tests sit above production,
|
||||
ADR-0007); it triggers the whole-package eager audit, which already runs on
|
||||
every `kernbench` invocation. matplotlib stays lazily imported inside the
|
||||
renderers, so the audit's startup cost is unchanged.
|
||||
|
||||
### D6. Flat module naming (no `benches/` subfolder)
|
||||
|
||||
A `benches/` subpackage named `1H_milestone…` is impossible — a Python
|
||||
package name cannot start with a digit. The benches are therefore flat
|
||||
modules `milestone_1h_gemm.py` / `milestone_1h_ccl.py` with bench names
|
||||
`milestone-1h-gemm` / `milestone-1h-ccl` (kebab-case, letter-first per
|
||||
ADR-0045 D1).
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- `kernbench run --bench milestone-1h-gemm` (or `…-ccl`) regenerates all of
|
||||
a domain's results + figures in one command — the milestone requirement.
|
||||
- Single source for the eval logic (the bench), reused by tests and the
|
||||
script via shims; no duplication.
|
||||
- The figure tests and `scripts/gemm_sweep.py` keep working unchanged.
|
||||
|
||||
### Negative / limitations
|
||||
|
||||
- The two bench files are large (the CCL one mixes the distributed driver,
|
||||
sweeps, and matplotlib drawing). A "bench" that is mostly an eval harness
|
||||
is unusual; this ADR legitimizes it.
|
||||
- Generated artifacts live inside the source tree (`src/kernbench/benches/`)
|
||||
by explicit request; git-ignored to avoid committing them.
|
||||
- `milestone-1h-ccl` (and the default `milestone-1h-gemm`) take minutes —
|
||||
acceptable for an on-demand milestone artifact, not for routine runs.
|
||||
|
||||
## Dependencies
|
||||
|
||||
- **ADR-0007**: layer direction (why tests may import production but a bench
|
||||
may not import tests).
|
||||
- **ADR-0043 / ADR-0044**: the allreduce / GEMM eval harnesses this ADR
|
||||
relocates into benches.
|
||||
- **ADR-0045**: bench module contract; D2 here extends its D5 (single-device
|
||||
rule) with the eval-bench pattern, and relies on D4 (NO_REQUESTS) for the
|
||||
sentinel.
|
||||
- **ADR-0024**: rank = SIP launcher driven by the allreduce sweeps.
|
||||
|
||||
## Open questions
|
||||
|
||||
- Should the GEMM theoretical-model constants (ADR-0044 D5) be sourced from
|
||||
ADR-0033/0014 rather than copied? Unchanged by this ADR.
|
||||
- Should `build_overview_slides.py` consume the milestone output PNGs
|
||||
instead of drawing GEMM bars natively? Still open (ADR-0044 D6 / Negative).
|
||||
Reference in New Issue
Block a user