cc1bbd0ab7
Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/
into two self-contained eval benches so a user can regenerate every
result + figure with one command:
kernbench run --bench milestone-1h-gemm (MILESTONE_FAST=1 reuses JSON)
kernbench run --bench milestone-1h-ccl
- benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the
run(torch) entry drives the sweeps and writes figures into
benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a
sentinel tensor to satisfy the run_bench contract.
- tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin
re-export/wrapper shims over the benches (single source preserved); the
pytest-only param builders + _run_distributed wrapper stay in the shim.
- eval-bench pattern: a bench may drive many configs + build its own
per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2).
ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI
Semantics amended; ADR INDEX regenerated. Verified: milestone benches run
clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
137 lines
6.0 KiB
Markdown
137 lines
6.0 KiB
Markdown
# ADR-0044: GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/`
|
||
|
||
## Status
|
||
|
||
Accepted
|
||
|
||
Documents the GEMM evaluation/characterization harness; verified against the
|
||
implementation (constants, tile sizes, figure set, and the script↔test
|
||
split cross-checked). The D5/D6 caveats are recorded limitations, not
|
||
inaccuracies.
|
||
|
||
**Amended by ADR-0054**: the sweep + renderers moved into the
|
||
`milestone-1h-gemm` bench (single home); `scripts/gemm_sweep.py` and
|
||
`tests/gemm/` now re-export from it. D1/D2's "data generation stays a manual
|
||
script / heavy work is opt-in" is superseded by the eval-bench pattern (one
|
||
bench regenerates everything; `MILESTONE_FAST=1` reuses the committed JSON).
|
||
|
||
## Context
|
||
|
||
ADR-0014 (PE pipeline) and ADR-0042 (tile-plan generators) define the GEMM
|
||
*implementation*; ADR-0033 defines the latency model. None of them describe
|
||
**how GEMM performance is swept and characterized** — the shape/variant
|
||
sweep that produces the timing data, and the figures that interpret it.
|
||
This ADR pins that harness.
|
||
|
||
Unlike the allreduce harness (ADR-0043), the GEMM sweep is **heavy** (24
|
||
sim runs: 8 shapes × 3 operand-staging variants; the `512` shape alone is
|
||
2048 tiles). That weight drives the split below.
|
||
|
||
## Decision
|
||
|
||
### D1. Two-layer split — heavy data generation (script) vs. fast figures (tests)
|
||
|
||
- **Data generation stays a manual script**: `scripts/gemm_sweep.py` runs
|
||
`matmul-composite` (ADR-0042 plans) across shapes × variants via the same
|
||
`run_bench` path the CLI uses, harvests `result.engine.op_log`, and
|
||
writes `docs/diagrams/gemm_sweep.json` (per-stage / per-engine wall-clock
|
||
+ occupancy + record counts + pe/composite windows).
|
||
- **Figure rendering is test-generated**: `tests/gemm/` reads the committed
|
||
`gemm_sweep.json` and renders matplotlib PNGs into
|
||
`docs/diagrams/gemm_plots/`. These tests are fast and run by default.
|
||
|
||
Rationale: a slide-deck-scale sim sweep does not belong in every `pytest`
|
||
run, but the figures (cheap, deterministic) should regenerate freely and be
|
||
guarded by CI. This mirrors CLAUDE.md's script-vs-test split (scripts for
|
||
heavy/manual generation; tests for fast assertions).
|
||
|
||
### D2. Slow regenerator test wraps the script
|
||
|
||
`tests/gemm/test_gemm_sweep.py` is marked `@pytest.mark.slow` (excluded by
|
||
the default `addopts: -m "not slow"`). It invokes `scripts/gemm_sweep.py`
|
||
via subprocess to regenerate `gemm_sweep.json` on demand
|
||
(`pytest -m slow tests/gemm/test_gemm_sweep.py`). The sweep logic has a
|
||
single home (the script); the test only wraps it, so there is no duplicated
|
||
sim-driving code.
|
||
|
||
### D3. Figure set (3 charts, `load_ref` variant)
|
||
|
||
| Test | PNG | Content |
|
||
|---|---|---|
|
||
| `test_plot_gemm_stage_breakdown.py` | `gemm_stage_breakdown.png` | per-stage engine wall-clock (DMA in / Fetch / GEMM / DMA out) |
|
||
| `test_plot_gemm_mac_utilization.py` | `gemm_mac_utilization_measured.png` | GEMM util % + useful eff % |
|
||
| `test_plot_gemm_mac_utilization.py` | `gemm_mac_utilization_theoretical_vs_measured.png` | theoretical vs simulator-measured util/eff |
|
||
|
||
`tests/gemm/_gemm_plot_helpers.py` holds the shared renderers (series logic
|
||
mirrors the GEMM `_render_*` functions in `scripts/build_overview_slides.py`,
|
||
which still draws these natively in the PPTX). Not collected (no `test_`
|
||
prefix). Each `test_plot_*` skips if `gemm_sweep.json` is absent.
|
||
|
||
### D4. Tile sizes are data-driven; under-tile shapes are flagged
|
||
|
||
Tile sizes are read from `gemm_sweep.json` (`tile_sizes`), which the sweep
|
||
records from `PeSchedulerComponent.TILE_M/K/N = 32/64/32` — the authoritative
|
||
source. Shapes with `M<TILE_M ∨ K<TILE_K ∨ N<TILE_N` are flagged
|
||
("under-tile") on the charts. The `512³` shape is excluded from the figures
|
||
(`EXCLUDED_SHAPES`).
|
||
|
||
### D5. Theoretical model — inherited constants, NOT yet ADR-verified
|
||
|
||
The "theoretical" curves use an analytical ideal-pipeline model with
|
||
constants copied verbatim from `scripts/build_overview_slides.py`:
|
||
|
||
```
|
||
HBM_GBS = 256.0 # GB/s T_STAGE = 16.0 ns
|
||
D_STAGES = 3 BPE = 2
|
||
```
|
||
|
||
**These are not yet sourced against the ADRs.** Notably ADR-0033's `256`
|
||
is `burst_bytes` (256 B), a *different* quantity than this `256 GB/s`, and
|
||
ADR-0033 derives bandwidth as `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`.
|
||
`T_STAGE`/stage-count are not traced to ADR-0014 here. The model is
|
||
therefore **consistent with the existing deck script, not verified against
|
||
the ADRs**, and the constants are duplicated (deck + helper). Reconciling
|
||
them (source from topology/ADR-0033/0014, de-duplicate) is deferred — see
|
||
Open questions.
|
||
|
||
### D6. Known naming caveat — `_measured` chart
|
||
|
||
`gemm_mac_utilization_measured.png` currently plots the *theoretical*
|
||
ideal-pipeline numbers (its footnote says so), only the filename says
|
||
"measured". This is a known misnomer pending a decision to either repoint
|
||
its content to the simulator-measured series or retitle it.
|
||
|
||
## Consequences
|
||
|
||
### Positive
|
||
|
||
- GEMM figures are test-generated and CI-guarded, like allreduce.
|
||
- The heavy sweep stays opt-in, keeping the default test run fast.
|
||
- Single source for the sweep logic (the script), reused by the slow test.
|
||
|
||
### Negative / limitations
|
||
|
||
- The theoretical-model constants (D5) are unverified and duplicated.
|
||
- The `_measured` figure is a misnomer (D6).
|
||
- `build_overview_slides.py` still renders the GEMM bars natively from
|
||
`gemm_sweep.json` rather than embedding these PNGs — the deck rewiring to
|
||
consume the test artifacts is not done.
|
||
|
||
## Dependencies
|
||
|
||
- **ADR-0013**: verification strategy.
|
||
- **ADR-0014 / ADR-0042**: PE pipeline + tile-plan generators — the GEMM
|
||
implementation the sweep measures; D4's stage record counts come from
|
||
ADR-0042 D2/D3.
|
||
- **ADR-0033**: latency model — the source the D5 constants should (but do
|
||
not yet) trace to.
|
||
- **ADR-0043**: the sibling allreduce evaluation harness.
|
||
|
||
## Open questions
|
||
|
||
- Reconcile D5 constants against `topology.yaml` / ADR-0033 / ADR-0014 and
|
||
de-duplicate (one source for the model parameters)?
|
||
- Resolve the D6 `_measured` naming (repoint content vs. retitle)?
|
||
- Rewire `build_overview_slides.py` to embed the `gemm_plots/` PNGs instead
|
||
of native bar-drawing?
|