Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/
into two self-contained eval benches so a user can regenerate every
result + figure with one command:
kernbench run --bench milestone-1h-gemm (MILESTONE_FAST=1 reuses JSON)
kernbench run --bench milestone-1h-ccl
- benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the
run(torch) entry drives the sweeps and writes figures into
benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a
sentinel tensor to satisfy the run_bench contract.
- tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin
re-export/wrapper shims over the benches (single source preserved); the
pytest-only param builders + _run_distributed wrapper stay in the shim.
- eval-bench pattern: a bench may drive many configs + build its own
per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2).
ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI
Semantics amended; ADR INDEX regenerated. Verified: milestone benches run
clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.0 KiB
ADR-0044: GEMM Evaluation Harness — scripts/gemm_sweep.py + tests/gemm/
Status
Accepted
Documents the GEMM evaluation/characterization harness; verified against the implementation (constants, tile sizes, figure set, and the script↔test split cross-checked). The D5/D6 caveats are recorded limitations, not inaccuracies.
Amended by ADR-0054: the sweep + renderers moved into the
milestone-1h-gemm bench (single home); scripts/gemm_sweep.py and
tests/gemm/ now re-export from it. D1/D2's "data generation stays a manual
script / heavy work is opt-in" is superseded by the eval-bench pattern (one
bench regenerates everything; MILESTONE_FAST=1 reuses the committed JSON).
Context
ADR-0014 (PE pipeline) and ADR-0042 (tile-plan generators) define the GEMM implementation; ADR-0033 defines the latency model. None of them describe how GEMM performance is swept and characterized — the shape/variant sweep that produces the timing data, and the figures that interpret it. This ADR pins that harness.
Unlike the allreduce harness (ADR-0043), the GEMM sweep is heavy (24
sim runs: 8 shapes × 3 operand-staging variants; the 512 shape alone is
2048 tiles). That weight drives the split below.
Decision
D1. Two-layer split — heavy data generation (script) vs. fast figures (tests)
- Data generation stays a manual script:
scripts/gemm_sweep.pyrunsmatmul-composite(ADR-0042 plans) across shapes × variants via the samerun_benchpath the CLI uses, harvestsresult.engine.op_log, and writesdocs/diagrams/gemm_sweep.json(per-stage / per-engine wall-clock- occupancy + record counts + pe/composite windows).
- Figure rendering is test-generated:
tests/gemm/reads the committedgemm_sweep.jsonand renders matplotlib PNGs intodocs/diagrams/gemm_plots/. These tests are fast and run by default.
Rationale: a slide-deck-scale sim sweep does not belong in every pytest
run, but the figures (cheap, deterministic) should regenerate freely and be
guarded by CI. This mirrors CLAUDE.md's script-vs-test split (scripts for
heavy/manual generation; tests for fast assertions).
D2. Slow regenerator test wraps the script
tests/gemm/test_gemm_sweep.py is marked @pytest.mark.slow (excluded by
the default addopts: -m "not slow"). It invokes scripts/gemm_sweep.py
via subprocess to regenerate gemm_sweep.json on demand
(pytest -m slow tests/gemm/test_gemm_sweep.py). The sweep logic has a
single home (the script); the test only wraps it, so there is no duplicated
sim-driving code.
D3. Figure set (3 charts, load_ref variant)
| Test | PNG | Content |
|---|---|---|
test_plot_gemm_stage_breakdown.py |
gemm_stage_breakdown.png |
per-stage engine wall-clock (DMA in / Fetch / GEMM / DMA out) |
test_plot_gemm_mac_utilization.py |
gemm_mac_utilization_measured.png |
GEMM util % + useful eff % |
test_plot_gemm_mac_utilization.py |
gemm_mac_utilization_theoretical_vs_measured.png |
theoretical vs simulator-measured util/eff |
tests/gemm/_gemm_plot_helpers.py holds the shared renderers (series logic
mirrors the GEMM _render_* functions in scripts/build_overview_slides.py,
which still draws these natively in the PPTX). Not collected (no test_
prefix). Each test_plot_* skips if gemm_sweep.json is absent.
D4. Tile sizes are data-driven; under-tile shapes are flagged
Tile sizes are read from gemm_sweep.json (tile_sizes), which the sweep
records from PeSchedulerComponent.TILE_M/K/N = 32/64/32 — the authoritative
source. Shapes with M<TILE_M ∨ K<TILE_K ∨ N<TILE_N are flagged
("under-tile") on the charts. The 512³ shape is excluded from the figures
(EXCLUDED_SHAPES).
D5. Theoretical model — inherited constants, NOT yet ADR-verified
The "theoretical" curves use an analytical ideal-pipeline model with
constants copied verbatim from scripts/build_overview_slides.py:
HBM_GBS = 256.0 # GB/s T_STAGE = 16.0 ns
D_STAGES = 3 BPE = 2
These are not yet sourced against the ADRs. Notably ADR-0033's 256
is burst_bytes (256 B), a different quantity than this 256 GB/s, and
ADR-0033 derives bandwidth as pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs.
T_STAGE/stage-count are not traced to ADR-0014 here. The model is
therefore consistent with the existing deck script, not verified against
the ADRs, and the constants are duplicated (deck + helper). Reconciling
them (source from topology/ADR-0033/0014, de-duplicate) is deferred — see
Open questions.
D6. Known naming caveat — _measured chart
gemm_mac_utilization_measured.png currently plots the theoretical
ideal-pipeline numbers (its footnote says so), only the filename says
"measured". This is a known misnomer pending a decision to either repoint
its content to the simulator-measured series or retitle it.
Consequences
Positive
- GEMM figures are test-generated and CI-guarded, like allreduce.
- The heavy sweep stays opt-in, keeping the default test run fast.
- Single source for the sweep logic (the script), reused by the slow test.
Negative / limitations
- The theoretical-model constants (D5) are unverified and duplicated.
- The
_measuredfigure is a misnomer (D6). build_overview_slides.pystill renders the GEMM bars natively fromgemm_sweep.jsonrather than embedding these PNGs — the deck rewiring to consume the test artifacts is not done.
Dependencies
- ADR-0013: verification strategy.
- ADR-0014 / ADR-0042: PE pipeline + tile-plan generators — the GEMM implementation the sweep measures; D4's stage record counts come from ADR-0042 D2/D3.
- ADR-0033: latency model — the source the D5 constants should (but do not yet) trace to.
- ADR-0043: the sibling allreduce evaluation harness.
Open questions
- Reconcile D5 constants against
topology.yaml/ ADR-0033 / ADR-0014 and de-duplicate (one source for the model parameters)? - Resolve the D6
_measurednaming (repoint content vs. retitle)? - Rewire
build_overview_slides.pyto embed thegemm_plots/PNGs instead of native bar-drawing?