Files
kernbench2/docs/adr/ADR-0044-eval-gemm-harness.md
mukesh fd56b6cacd adr: add ADR-0043/0044 (eval harnesses); reconcile ADR-0024/0032 for SIP w/h
Document the allreduce + GEMM evaluation harnesses and bring the affected
allreduce ADRs in line with the refactored code.

New (Accepted, EN + KO):
- ADR-0043 — allreduce evaluation harness (tests/sccl/): distributed-driven
  correctness, latency/buffer-kind sweeps, sessionfinish plot aggregators,
  topology + FSIM-comparison figures. Verified against the implementation.
- ADR-0044 — GEMM evaluation harness (scripts/gemm_sweep.py + tests/gemm/):
  heavy-script data gen vs. fast test-rendered figures, slow regenerator,
  the 3-figure set. Records two limitations as open questions: the
  theoretical-model constants are inherited (not yet traced to ADR-0033/
  0014), and the *_measured figure is a naming misnomer.

Updated (EN + KO):
- ADR-0024 — add D5: SIP grid w/h resolution (explicit sips.w/h, square
  fallback, fail-loud), documenting the AhbmCCLBackend fix.
- ADR-0032 — D4/D5/Non-goals reconciled: rectangular SIP grids (e.g. 6 SIPs
  as 3x2) are supported via explicit w/h; the square requirement now
  applies only to the fallback. Affected-files repointed to tests/sccl/.

Verification: ADR-0023 and ADR-0042 confirmed still matching the code (no
change). verify_adr_lang_pairs.py passes (EN/KO Status blocks byte-equal).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 10:26:25 -07:00

131 lines
5.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0044: GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/`
## Status
Accepted
Documents the GEMM evaluation/characterization harness; verified against the
implementation (constants, tile sizes, figure set, and the script↔test
split cross-checked). The D5/D6 caveats are recorded limitations, not
inaccuracies.
## Context
ADR-0014 (PE pipeline) and ADR-0042 (tile-plan generators) define the GEMM
*implementation*; ADR-0033 defines the latency model. None of them describe
**how GEMM performance is swept and characterized** — the shape/variant
sweep that produces the timing data, and the figures that interpret it.
This ADR pins that harness.
Unlike the allreduce harness (ADR-0043), the GEMM sweep is **heavy** (24
sim runs: 8 shapes × 3 operand-staging variants; the `512` shape alone is
2048 tiles). That weight drives the split below.
## Decision
### D1. Two-layer split — heavy data generation (script) vs. fast figures (tests)
- **Data generation stays a manual script**: `scripts/gemm_sweep.py` runs
`matmul-composite` (ADR-0042 plans) across shapes × variants via the same
`run_bench` path the CLI uses, harvests `result.engine.op_log`, and
writes `docs/diagrams/gemm_sweep.json` (per-stage / per-engine wall-clock
+ occupancy + record counts + pe/composite windows).
- **Figure rendering is test-generated**: `tests/gemm/` reads the committed
`gemm_sweep.json` and renders matplotlib PNGs into
`docs/diagrams/gemm_plots/`. These tests are fast and run by default.
Rationale: a slide-deck-scale sim sweep does not belong in every `pytest`
run, but the figures (cheap, deterministic) should regenerate freely and be
guarded by CI. This mirrors CLAUDE.md's script-vs-test split (scripts for
heavy/manual generation; tests for fast assertions).
### D2. Slow regenerator test wraps the script
`tests/gemm/test_gemm_sweep.py` is marked `@pytest.mark.slow` (excluded by
the default `addopts: -m "not slow"`). It invokes `scripts/gemm_sweep.py`
via subprocess to regenerate `gemm_sweep.json` on demand
(`pytest -m slow tests/gemm/test_gemm_sweep.py`). The sweep logic has a
single home (the script); the test only wraps it, so there is no duplicated
sim-driving code.
### D3. Figure set (3 charts, `load_ref` variant)
| Test | PNG | Content |
|---|---|---|
| `test_plot_gemm_stage_breakdown.py` | `gemm_stage_breakdown.png` | per-stage engine wall-clock (DMA in / Fetch / GEMM / DMA out) |
| `test_plot_gemm_mac_utilization.py` | `gemm_mac_utilization_measured.png` | GEMM util % + useful eff % |
| `test_plot_gemm_mac_utilization.py` | `gemm_mac_utilization_theoretical_vs_measured.png` | theoretical vs simulator-measured util/eff |
`tests/gemm/_gemm_plot_helpers.py` holds the shared renderers (series logic
mirrors the GEMM `_render_*` functions in `scripts/build_overview_slides.py`,
which still draws these natively in the PPTX). Not collected (no `test_`
prefix). Each `test_plot_*` skips if `gemm_sweep.json` is absent.
### D4. Tile sizes are data-driven; under-tile shapes are flagged
Tile sizes are read from `gemm_sweep.json` (`tile_sizes`), which the sweep
records from `PeSchedulerComponent.TILE_M/K/N = 32/64/32` — the authoritative
source. Shapes with `M<TILE_M K<TILE_K N<TILE_N` are flagged
("under-tile") on the charts. The `512³` shape is excluded from the figures
(`EXCLUDED_SHAPES`).
### D5. Theoretical model — inherited constants, NOT yet ADR-verified
The "theoretical" curves use an analytical ideal-pipeline model with
constants copied verbatim from `scripts/build_overview_slides.py`:
```
HBM_GBS = 256.0 # GB/s T_STAGE = 16.0 ns
D_STAGES = 3 BPE = 2
```
**These are not yet sourced against the ADRs.** Notably ADR-0033's `256`
is `burst_bytes` (256 B), a *different* quantity than this `256 GB/s`, and
ADR-0033 derives bandwidth as `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`.
`T_STAGE`/stage-count are not traced to ADR-0014 here. The model is
therefore **consistent with the existing deck script, not verified against
the ADRs**, and the constants are duplicated (deck + helper). Reconciling
them (source from topology/ADR-0033/0014, de-duplicate) is deferred — see
Open questions.
### D6. Known naming caveat — `_measured` chart
`gemm_mac_utilization_measured.png` currently plots the *theoretical*
ideal-pipeline numbers (its footnote says so), only the filename says
"measured". This is a known misnomer pending a decision to either repoint
its content to the simulator-measured series or retitle it.
## Consequences
### Positive
- GEMM figures are test-generated and CI-guarded, like allreduce.
- The heavy sweep stays opt-in, keeping the default test run fast.
- Single source for the sweep logic (the script), reused by the slow test.
### Negative / limitations
- The theoretical-model constants (D5) are unverified and duplicated.
- The `_measured` figure is a misnomer (D6).
- `build_overview_slides.py` still renders the GEMM bars natively from
`gemm_sweep.json` rather than embedding these PNGs — the deck rewiring to
consume the test artifacts is not done.
## Dependencies
- **ADR-0013**: verification strategy.
- **ADR-0014 / ADR-0042**: PE pipeline + tile-plan generators — the GEMM
implementation the sweep measures; D4's stage record counts come from
ADR-0042 D2/D3.
- **ADR-0033**: latency model — the source the D5 constants should (but do
not yet) trace to.
- **ADR-0043**: the sibling allreduce evaluation harness.
## Open questions
- Reconcile D5 constants against `topology.yaml` / ADR-0033 / ADR-0014 and
de-duplicate (one source for the model parameters)?
- Resolve the D6 `_measured` naming (repoint content vs. retitle)?
- Rewire `build_overview_slides.py` to embed the `gemm_plots/` PNGs instead
of native bar-drawing?