Document the allreduce + GEMM evaluation harnesses and bring the affected allreduce ADRs in line with the refactored code. New (Accepted, EN + KO): - ADR-0043 — allreduce evaluation harness (tests/sccl/): distributed-driven correctness, latency/buffer-kind sweeps, sessionfinish plot aggregators, topology + FSIM-comparison figures. Verified against the implementation. - ADR-0044 — GEMM evaluation harness (scripts/gemm_sweep.py + tests/gemm/): heavy-script data gen vs. fast test-rendered figures, slow regenerator, the 3-figure set. Records two limitations as open questions: the theoretical-model constants are inherited (not yet traced to ADR-0033/ 0014), and the *_measured figure is a naming misnomer. Updated (EN + KO): - ADR-0024 — add D5: SIP grid w/h resolution (explicit sips.w/h, square fallback, fail-loud), documenting the AhbmCCLBackend fix. - ADR-0032 — D4/D5/Non-goals reconciled: rectangular SIP grids (e.g. 6 SIPs as 3x2) are supported via explicit w/h; the square requirement now applies only to the fallback. Affected-files repointed to tests/sccl/. Verification: ADR-0023 and ADR-0042 confirmed still matching the code (no change). verify_adr_lang_pairs.py passes (EN/KO Status blocks byte-equal). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.7 KiB
ADR-0044: GEMM Evaluation Harness — scripts/gemm_sweep.py + tests/gemm/
Status
Accepted
Documents the GEMM evaluation/characterization harness; verified against the implementation (constants, tile sizes, figure set, and the script↔test split cross-checked). The D5/D6 caveats are recorded limitations, not inaccuracies.
Context
ADR-0014 (PE pipeline) and ADR-0042 (tile-plan generators) define the GEMM implementation; ADR-0033 defines the latency model. None of them describe how GEMM performance is swept and characterized — the shape/variant sweep that produces the timing data, and the figures that interpret it. This ADR pins that harness.
Unlike the allreduce harness (ADR-0043), the GEMM sweep is heavy (24
sim runs: 8 shapes × 3 operand-staging variants; the 512 shape alone is
2048 tiles). That weight drives the split below.
Decision
D1. Two-layer split — heavy data generation (script) vs. fast figures (tests)
- Data generation stays a manual script:
scripts/gemm_sweep.pyrunsmatmul-composite(ADR-0042 plans) across shapes × variants via the samerun_benchpath the CLI uses, harvestsresult.engine.op_log, and writesdocs/diagrams/gemm_sweep.json(per-stage / per-engine wall-clock- occupancy + record counts + pe/composite windows).
- Figure rendering is test-generated:
tests/gemm/reads the committedgemm_sweep.jsonand renders matplotlib PNGs intodocs/diagrams/gemm_plots/. These tests are fast and run by default.
Rationale: a slide-deck-scale sim sweep does not belong in every pytest
run, but the figures (cheap, deterministic) should regenerate freely and be
guarded by CI. This mirrors CLAUDE.md's script-vs-test split (scripts for
heavy/manual generation; tests for fast assertions).
D2. Slow regenerator test wraps the script
tests/gemm/test_gemm_sweep.py is marked @pytest.mark.slow (excluded by
the default addopts: -m "not slow"). It invokes scripts/gemm_sweep.py
via subprocess to regenerate gemm_sweep.json on demand
(pytest -m slow tests/gemm/test_gemm_sweep.py). The sweep logic has a
single home (the script); the test only wraps it, so there is no duplicated
sim-driving code.
D3. Figure set (3 charts, load_ref variant)
| Test | PNG | Content |
|---|---|---|
test_plot_gemm_stage_breakdown.py |
gemm_stage_breakdown.png |
per-stage engine wall-clock (DMA in / Fetch / GEMM / DMA out) |
test_plot_gemm_mac_utilization.py |
gemm_mac_utilization_measured.png |
GEMM util % + useful eff % |
test_plot_gemm_mac_utilization.py |
gemm_mac_utilization_theoretical_vs_measured.png |
theoretical vs simulator-measured util/eff |
tests/gemm/_gemm_plot_helpers.py holds the shared renderers (series logic
mirrors the GEMM _render_* functions in scripts/build_overview_slides.py,
which still draws these natively in the PPTX). Not collected (no test_
prefix). Each test_plot_* skips if gemm_sweep.json is absent.
D4. Tile sizes are data-driven; under-tile shapes are flagged
Tile sizes are read from gemm_sweep.json (tile_sizes), which the sweep
records from PeSchedulerComponent.TILE_M/K/N = 32/64/32 — the authoritative
source. Shapes with M<TILE_M ∨ K<TILE_K ∨ N<TILE_N are flagged
("under-tile") on the charts. The 512³ shape is excluded from the figures
(EXCLUDED_SHAPES).
D5. Theoretical model — inherited constants, NOT yet ADR-verified
The "theoretical" curves use an analytical ideal-pipeline model with
constants copied verbatim from scripts/build_overview_slides.py:
HBM_GBS = 256.0 # GB/s T_STAGE = 16.0 ns
D_STAGES = 3 BPE = 2
These are not yet sourced against the ADRs. Notably ADR-0033's 256
is burst_bytes (256 B), a different quantity than this 256 GB/s, and
ADR-0033 derives bandwidth as pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs.
T_STAGE/stage-count are not traced to ADR-0014 here. The model is
therefore consistent with the existing deck script, not verified against
the ADRs, and the constants are duplicated (deck + helper). Reconciling
them (source from topology/ADR-0033/0014, de-duplicate) is deferred — see
Open questions.
D6. Known naming caveat — _measured chart
gemm_mac_utilization_measured.png currently plots the theoretical
ideal-pipeline numbers (its footnote says so), only the filename says
"measured". This is a known misnomer pending a decision to either repoint
its content to the simulator-measured series or retitle it.
Consequences
Positive
- GEMM figures are test-generated and CI-guarded, like allreduce.
- The heavy sweep stays opt-in, keeping the default test run fast.
- Single source for the sweep logic (the script), reused by the slow test.
Negative / limitations
- The theoretical-model constants (D5) are unverified and duplicated.
- The
_measuredfigure is a misnomer (D6). build_overview_slides.pystill renders the GEMM bars natively fromgemm_sweep.jsonrather than embedding these PNGs — the deck rewiring to consume the test artifacts is not done.
Dependencies
- ADR-0013: verification strategy.
- ADR-0014 / ADR-0042: PE pipeline + tile-plan generators — the GEMM implementation the sweep measures; D4's stage record counts come from ADR-0042 D2/D3.
- ADR-0033: latency model — the source the D5 constants should (but do not yet) trace to.
- ADR-0043: the sibling allreduce evaluation harness.
Open questions
- Reconcile D5 constants against
topology.yaml/ ADR-0033 / ADR-0014 and de-duplicate (one source for the model parameters)? - Resolve the D6
_measurednaming (repoint content vs. retitle)? - Rewire
build_overview_slides.pyto embed thegemm_plots/PNGs instead of native bar-drawing?