Files
kernbench2/docs/adr/ADR-0044-eval-gemm-harness.md
T
mukesh fd56b6cacd adr: add ADR-0043/0044 (eval harnesses); reconcile ADR-0024/0032 for SIP w/h
Document the allreduce + GEMM evaluation harnesses and bring the affected
allreduce ADRs in line with the refactored code.

New (Accepted, EN + KO):
- ADR-0043 — allreduce evaluation harness (tests/sccl/): distributed-driven
  correctness, latency/buffer-kind sweeps, sessionfinish plot aggregators,
  topology + FSIM-comparison figures. Verified against the implementation.
- ADR-0044 — GEMM evaluation harness (scripts/gemm_sweep.py + tests/gemm/):
  heavy-script data gen vs. fast test-rendered figures, slow regenerator,
  the 3-figure set. Records two limitations as open questions: the
  theoretical-model constants are inherited (not yet traced to ADR-0033/
  0014), and the *_measured figure is a naming misnomer.

Updated (EN + KO):
- ADR-0024 — add D5: SIP grid w/h resolution (explicit sips.w/h, square
  fallback, fail-loud), documenting the AhbmCCLBackend fix.
- ADR-0032 — D4/D5/Non-goals reconciled: rectangular SIP grids (e.g. 6 SIPs
  as 3x2) are supported via explicit w/h; the square requirement now
  applies only to the fallback. Affected-files repointed to tests/sccl/.

Verification: ADR-0023 and ADR-0042 confirmed still matching the code (no
change). verify_adr_lang_pairs.py passes (EN/KO Status blocks byte-equal).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 10:26:25 -07:00

5.7 KiB
Raw Blame History

ADR-0044: GEMM Evaluation Harness — scripts/gemm_sweep.py + tests/gemm/

Status

Accepted

Documents the GEMM evaluation/characterization harness; verified against the implementation (constants, tile sizes, figure set, and the script↔test split cross-checked). The D5/D6 caveats are recorded limitations, not inaccuracies.

Context

ADR-0014 (PE pipeline) and ADR-0042 (tile-plan generators) define the GEMM implementation; ADR-0033 defines the latency model. None of them describe how GEMM performance is swept and characterized — the shape/variant sweep that produces the timing data, and the figures that interpret it. This ADR pins that harness.

Unlike the allreduce harness (ADR-0043), the GEMM sweep is heavy (24 sim runs: 8 shapes × 3 operand-staging variants; the 512 shape alone is 2048 tiles). That weight drives the split below.

Decision

D1. Two-layer split — heavy data generation (script) vs. fast figures (tests)

  • Data generation stays a manual script: scripts/gemm_sweep.py runs matmul-composite (ADR-0042 plans) across shapes × variants via the same run_bench path the CLI uses, harvests result.engine.op_log, and writes docs/diagrams/gemm_sweep.json (per-stage / per-engine wall-clock
    • occupancy + record counts + pe/composite windows).
  • Figure rendering is test-generated: tests/gemm/ reads the committed gemm_sweep.json and renders matplotlib PNGs into docs/diagrams/gemm_plots/. These tests are fast and run by default.

Rationale: a slide-deck-scale sim sweep does not belong in every pytest run, but the figures (cheap, deterministic) should regenerate freely and be guarded by CI. This mirrors CLAUDE.md's script-vs-test split (scripts for heavy/manual generation; tests for fast assertions).

D2. Slow regenerator test wraps the script

tests/gemm/test_gemm_sweep.py is marked @pytest.mark.slow (excluded by the default addopts: -m "not slow"). It invokes scripts/gemm_sweep.py via subprocess to regenerate gemm_sweep.json on demand (pytest -m slow tests/gemm/test_gemm_sweep.py). The sweep logic has a single home (the script); the test only wraps it, so there is no duplicated sim-driving code.

D3. Figure set (3 charts, load_ref variant)

Test PNG Content
test_plot_gemm_stage_breakdown.py gemm_stage_breakdown.png per-stage engine wall-clock (DMA in / Fetch / GEMM / DMA out)
test_plot_gemm_mac_utilization.py gemm_mac_utilization_measured.png GEMM util % + useful eff %
test_plot_gemm_mac_utilization.py gemm_mac_utilization_theoretical_vs_measured.png theoretical vs simulator-measured util/eff

tests/gemm/_gemm_plot_helpers.py holds the shared renderers (series logic mirrors the GEMM _render_* functions in scripts/build_overview_slides.py, which still draws these natively in the PPTX). Not collected (no test_ prefix). Each test_plot_* skips if gemm_sweep.json is absent.

D4. Tile sizes are data-driven; under-tile shapes are flagged

Tile sizes are read from gemm_sweep.json (tile_sizes), which the sweep records from PeSchedulerComponent.TILE_M/K/N = 32/64/32 — the authoritative source. Shapes with M<TILE_M K<TILE_K N<TILE_N are flagged ("under-tile") on the charts. The 512³ shape is excluded from the figures (EXCLUDED_SHAPES).

D5. Theoretical model — inherited constants, NOT yet ADR-verified

The "theoretical" curves use an analytical ideal-pipeline model with constants copied verbatim from scripts/build_overview_slides.py:

HBM_GBS = 256.0   # GB/s        T_STAGE = 16.0 ns
D_STAGES = 3                    BPE = 2

These are not yet sourced against the ADRs. Notably ADR-0033's 256 is burst_bytes (256 B), a different quantity than this 256 GB/s, and ADR-0033 derives bandwidth as pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs. T_STAGE/stage-count are not traced to ADR-0014 here. The model is therefore consistent with the existing deck script, not verified against the ADRs, and the constants are duplicated (deck + helper). Reconciling them (source from topology/ADR-0033/0014, de-duplicate) is deferred — see Open questions.

D6. Known naming caveat — _measured chart

gemm_mac_utilization_measured.png currently plots the theoretical ideal-pipeline numbers (its footnote says so), only the filename says "measured". This is a known misnomer pending a decision to either repoint its content to the simulator-measured series or retitle it.

Consequences

Positive

  • GEMM figures are test-generated and CI-guarded, like allreduce.
  • The heavy sweep stays opt-in, keeping the default test run fast.
  • Single source for the sweep logic (the script), reused by the slow test.

Negative / limitations

  • The theoretical-model constants (D5) are unverified and duplicated.
  • The _measured figure is a misnomer (D6).
  • build_overview_slides.py still renders the GEMM bars natively from gemm_sweep.json rather than embedding these PNGs — the deck rewiring to consume the test artifacts is not done.

Dependencies

  • ADR-0013: verification strategy.
  • ADR-0014 / ADR-0042: PE pipeline + tile-plan generators — the GEMM implementation the sweep measures; D4's stage record counts come from ADR-0042 D2/D3.
  • ADR-0033: latency model — the source the D5 constants should (but do not yet) trace to.
  • ADR-0043: the sibling allreduce evaluation harness.

Open questions

  • Reconcile D5 constants against topology.yaml / ADR-0033 / ADR-0014 and de-duplicate (one source for the model parameters)?
  • Resolve the D6 _measured naming (repoint content vs. retitle)?
  • Rewire build_overview_slides.py to embed the gemm_plots/ PNGs instead of native bar-drawing?