Files
kernbench2/docs/adr/ADR-0044-eval-gemm-harness.md
mukesh cc1bbd0ab7 eval: fold GEMM/allreduce harnesses into self-contained milestone benches
Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/
into two self-contained eval benches so a user can regenerate every
result + figure with one command:

  kernbench run --bench milestone-1h-gemm   (MILESTONE_FAST=1 reuses JSON)
  kernbench run --bench milestone-1h-ccl

- benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the
  run(torch) entry drives the sweeps and writes figures into
  benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a
  sentinel tensor to satisfy the run_bench contract.
- tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin
  re-export/wrapper shims over the benches (single source preserved); the
  pytest-only param builders + _run_distributed wrapper stay in the shim.
- eval-bench pattern: a bench may drive many configs + build its own
  per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2).

ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI
Semantics amended; ADR INDEX regenerated. Verified: milestone benches run
clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 15:19:52 -07:00

6.0 KiB
Raw Permalink Blame History

ADR-0044: GEMM Evaluation Harness — scripts/gemm_sweep.py + tests/gemm/

Status

Accepted

Documents the GEMM evaluation/characterization harness; verified against the implementation (constants, tile sizes, figure set, and the script↔test split cross-checked). The D5/D6 caveats are recorded limitations, not inaccuracies.

Amended by ADR-0054: the sweep + renderers moved into the milestone-1h-gemm bench (single home); scripts/gemm_sweep.py and tests/gemm/ now re-export from it. D1/D2's "data generation stays a manual script / heavy work is opt-in" is superseded by the eval-bench pattern (one bench regenerates everything; MILESTONE_FAST=1 reuses the committed JSON).

Context

ADR-0014 (PE pipeline) and ADR-0042 (tile-plan generators) define the GEMM implementation; ADR-0033 defines the latency model. None of them describe how GEMM performance is swept and characterized — the shape/variant sweep that produces the timing data, and the figures that interpret it. This ADR pins that harness.

Unlike the allreduce harness (ADR-0043), the GEMM sweep is heavy (24 sim runs: 8 shapes × 3 operand-staging variants; the 512 shape alone is 2048 tiles). That weight drives the split below.

Decision

D1. Two-layer split — heavy data generation (script) vs. fast figures (tests)

  • Data generation stays a manual script: scripts/gemm_sweep.py runs matmul-composite (ADR-0042 plans) across shapes × variants via the same run_bench path the CLI uses, harvests result.engine.op_log, and writes docs/diagrams/gemm_sweep.json (per-stage / per-engine wall-clock
    • occupancy + record counts + pe/composite windows).
  • Figure rendering is test-generated: tests/gemm/ reads the committed gemm_sweep.json and renders matplotlib PNGs into docs/diagrams/gemm_plots/. These tests are fast and run by default.

Rationale: a slide-deck-scale sim sweep does not belong in every pytest run, but the figures (cheap, deterministic) should regenerate freely and be guarded by CI. This mirrors CLAUDE.md's script-vs-test split (scripts for heavy/manual generation; tests for fast assertions).

D2. Slow regenerator test wraps the script

tests/gemm/test_gemm_sweep.py is marked @pytest.mark.slow (excluded by the default addopts: -m "not slow"). It invokes scripts/gemm_sweep.py via subprocess to regenerate gemm_sweep.json on demand (pytest -m slow tests/gemm/test_gemm_sweep.py). The sweep logic has a single home (the script); the test only wraps it, so there is no duplicated sim-driving code.

D3. Figure set (3 charts, load_ref variant)

Test PNG Content
test_plot_gemm_stage_breakdown.py gemm_stage_breakdown.png per-stage engine wall-clock (DMA in / Fetch / GEMM / DMA out)
test_plot_gemm_mac_utilization.py gemm_mac_utilization_measured.png GEMM util % + useful eff %
test_plot_gemm_mac_utilization.py gemm_mac_utilization_theoretical_vs_measured.png theoretical vs simulator-measured util/eff

tests/gemm/_gemm_plot_helpers.py holds the shared renderers (series logic mirrors the GEMM _render_* functions in scripts/build_overview_slides.py, which still draws these natively in the PPTX). Not collected (no test_ prefix). Each test_plot_* skips if gemm_sweep.json is absent.

D4. Tile sizes are data-driven; under-tile shapes are flagged

Tile sizes are read from gemm_sweep.json (tile_sizes), which the sweep records from PeSchedulerComponent.TILE_M/K/N = 32/64/32 — the authoritative source. Shapes with M<TILE_M K<TILE_K N<TILE_N are flagged ("under-tile") on the charts. The 512³ shape is excluded from the figures (EXCLUDED_SHAPES).

D5. Theoretical model — inherited constants, NOT yet ADR-verified

The "theoretical" curves use an analytical ideal-pipeline model with constants copied verbatim from scripts/build_overview_slides.py:

HBM_GBS = 256.0   # GB/s        T_STAGE = 16.0 ns
D_STAGES = 3                    BPE = 2

These are not yet sourced against the ADRs. Notably ADR-0033's 256 is burst_bytes (256 B), a different quantity than this 256 GB/s, and ADR-0033 derives bandwidth as pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs. T_STAGE/stage-count are not traced to ADR-0014 here. The model is therefore consistent with the existing deck script, not verified against the ADRs, and the constants are duplicated (deck + helper). Reconciling them (source from topology/ADR-0033/0014, de-duplicate) is deferred — see Open questions.

D6. Known naming caveat — _measured chart

gemm_mac_utilization_measured.png currently plots the theoretical ideal-pipeline numbers (its footnote says so), only the filename says "measured". This is a known misnomer pending a decision to either repoint its content to the simulator-measured series or retitle it.

Consequences

Positive

  • GEMM figures are test-generated and CI-guarded, like allreduce.
  • The heavy sweep stays opt-in, keeping the default test run fast.
  • Single source for the sweep logic (the script), reused by the slow test.

Negative / limitations

  • The theoretical-model constants (D5) are unverified and duplicated.
  • The _measured figure is a misnomer (D6).
  • build_overview_slides.py still renders the GEMM bars natively from gemm_sweep.json rather than embedding these PNGs — the deck rewiring to consume the test artifacts is not done.

Dependencies

  • ADR-0013: verification strategy.
  • ADR-0014 / ADR-0042: PE pipeline + tile-plan generators — the GEMM implementation the sweep measures; D4's stage record counts come from ADR-0042 D2/D3.
  • ADR-0033: latency model — the source the D5 constants should (but do not yet) trace to.
  • ADR-0043: the sibling allreduce evaluation harness.

Open questions

  • Reconcile D5 constants against topology.yaml / ADR-0033 / ADR-0014 and de-duplicate (one source for the model parameters)?
  • Resolve the D6 _measured naming (repoint content vs. retitle)?
  • Rewire build_overview_slides.py to embed the gemm_plots/ PNGs instead of native bar-drawing?