# ADR-0044: GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/` ## Status Accepted Documents the GEMM evaluation/characterization harness; verified against the implementation (constants, tile sizes, figure set, and the script↔test split cross-checked). The D5/D6 caveats are recorded limitations, not inaccuracies. **Amended by ADR-0054**: the sweep + renderers moved into the `milestone-1h-gemm` bench (single home); `scripts/gemm_sweep.py` and `tests/gemm/` now re-export from it. D1/D2's "data generation stays a manual script / heavy work is opt-in" is superseded by the eval-bench pattern (one bench regenerates everything; `MILESTONE_FAST=1` reuses the committed JSON). ## Context ADR-0014 (PE pipeline) and ADR-0042 (tile-plan generators) define the GEMM *implementation*; ADR-0033 defines the latency model. None of them describe **how GEMM performance is swept and characterized** — the shape/variant sweep that produces the timing data, and the figures that interpret it. This ADR pins that harness. Unlike the allreduce harness (ADR-0043), the GEMM sweep is **heavy** (24 sim runs: 8 shapes × 3 operand-staging variants; the `512` shape alone is 2048 tiles). That weight drives the split below. ## Decision ### D1. Two-layer split — heavy data generation (script) vs. fast figures (tests) - **Data generation stays a manual script**: `scripts/gemm_sweep.py` runs `matmul-composite` (ADR-0042 plans) across shapes × variants via the same `run_bench` path the CLI uses, harvests `result.engine.op_log`, and writes `docs/diagrams/gemm_sweep.json` (per-stage / per-engine wall-clock + occupancy + record counts + pe/composite windows). - **Figure rendering is test-generated**: `tests/gemm/` reads the committed `gemm_sweep.json` and renders matplotlib PNGs into `docs/diagrams/gemm_plots/`. These tests are fast and run by default. Rationale: a slide-deck-scale sim sweep does not belong in every `pytest` run, but the figures (cheap, deterministic) should regenerate freely and be guarded by CI. This mirrors CLAUDE.md's script-vs-test split (scripts for heavy/manual generation; tests for fast assertions). ### D2. Slow regenerator test wraps the script `tests/gemm/test_gemm_sweep.py` is marked `@pytest.mark.slow` (excluded by the default `addopts: -m "not slow"`). It invokes `scripts/gemm_sweep.py` via subprocess to regenerate `gemm_sweep.json` on demand (`pytest -m slow tests/gemm/test_gemm_sweep.py`). The sweep logic has a single home (the script); the test only wraps it, so there is no duplicated sim-driving code. ### D3. Figure set (3 charts, `load_ref` variant) | Test | PNG | Content | |---|---|---| | `test_plot_gemm_stage_breakdown.py` | `gemm_stage_breakdown.png` | per-stage engine wall-clock (DMA in / Fetch / GEMM / DMA out) | | `test_plot_gemm_mac_utilization.py` | `gemm_mac_utilization_measured.png` | GEMM util % + useful eff % | | `test_plot_gemm_mac_utilization.py` | `gemm_mac_utilization_theoretical_vs_measured.png` | theoretical vs simulator-measured util/eff | `tests/gemm/_gemm_plot_helpers.py` holds the shared renderers (series logic mirrors the GEMM `_render_*` functions in `scripts/build_overview_slides.py`, which still draws these natively in the PPTX). Not collected (no `test_` prefix). Each `test_plot_*` skips if `gemm_sweep.json` is absent. ### D4. Tile sizes are data-driven; under-tile shapes are flagged Tile sizes are read from `gemm_sweep.json` (`tile_sizes`), which the sweep records from `PeSchedulerComponent.TILE_M/K/N = 32/64/32` — the authoritative source. Shapes with `M