eval: fold GEMM/allreduce harnesses into self-contained milestone benches

Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/ into two self-contained eval benches so a user can regenerate every result + figure with one command: kernbench run --bench milestone-1h-gemm (MILESTONE_FAST=1 reuses JSON) kernbench run --bench milestone-1h-ccl - benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the run(torch) entry drives the sweeps and writes figures into benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a sentinel tensor to satisfy the run_bench contract. - tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin re-export/wrapper shims over the benches (single source preserved); the pytest-only param builders + _run_distributed wrapper stay in the shim. - eval-bench pattern: a bench may drive many configs + build its own per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2). ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI Semantics amended; ADR INDEX regenerated. Verified: milestone benches run clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 15:19:32 -07:00
parent e33e76f2d1
commit cc1bbd0ab7
19 changed files with 2189 additions and 1465 deletions
@@ -6,6 +6,9 @@
 # Auto-generated mesh file
 cube_mesh.yaml
 # Milestone bench output (regenerable: kernbench run --bench milestone-1h-*)
 src/kernbench/benches/1H_milestone_output/
 # Python
 __pycache__/
 *.py[cod]
@@ -371,6 +371,13 @@ Concrete forms that Part 1's *Verification Plan* MUST take in this repo:
 - `kernbench run --device <id>` runs the benchmark on a single device.
 - Omitting `--device` runs the benchmark on all devices discovered in the topology (logically parallel).
 - Device enumeration is handled by the CLI only; benchmarks MUST remain single-device.
 - **Eval-bench exception (ADR-0054)**: a *milestone / eval bench*
  (`milestone-1h-*`) may drive many configurations and build its own
  per-config engines to regenerate a domain's full result + figure set; it
  ignores `--device` and submits a sentinel tensor to satisfy the
  "must submit ≥1 request" contract (ADR-0045 D4). This is the eval-harness
  carve-out to the single-device rule, alongside the ADR-0024 multi-SIP CCL
  exception.
 ## Derived Artifacts (Clarification)
@@ -7,6 +7,11 @@ Accepted
 `tests/sccl/` 평가 하니스를 문서화한다; 구현과 대조 검증 완료
 (상수, 파일 집합, 스윕 차원을 교차 확인).
 **ADR-0054로 개정됨**: 드라이버 코어, sweep, renderer가 `milestone-1h-ccl`
 bench(단일 home)로 이동했다; `tests/sccl/_allreduce_helpers.py`는 이제 거기서
 re-export한다(pytest 전용 param 빌더 + `_run_distributed` wrapper는 로컬
 유지). figure 테스트는 변경 없음.
 ## Context
 ADR-0032는 intercube all-reduce *알고리즘*을 정의하고, ADR-0023/0024/0027은
@@ -8,6 +8,12 @@ GEMM 평가/특성화 하니스를 문서화한다; 구현과 대조 검증 완
 (상수, tile 크기, figure 집합, script↔test 분할을 교차 확인). D5/D6
 caveat은 부정확이 아니라 기록된 한계다.
 **ADR-0054로 개정됨**: sweep + renderer가 `milestone-1h-gemm` bench(단일
 home)로 이동했다; `scripts/gemm_sweep.py`와 `tests/gemm/`는 이제 거기서
 re-export한다. D1/D2의 "데이터 생성은 수동 script / 무거운 작업은 opt-in"은
 평가-bench 패턴으로 대체된다(하나의 bench가 전부 재생성;
 `MILESTONE_FAST=1`은 committed JSON 재사용).
 ## Context
 ADR-0014(PE pipeline)와 ADR-0042(tile-plan generator)는 GEMM *구현*을
@@ -10,6 +10,10 @@ Accepted (2026-05-21).
 **bench가 어떻게 등록되고 어떤 함수 시그너처를 따라야 하는가**는 ADR 레벨에
 없었음.
 **ADR-0054로 확장됨**: D5의 단일 구성 규칙에 세 번째 패턴이 추가된다 —
 *평가 bench*(예: `milestone-1h-*`)는 여러 구성을 구동하고, 구성별 자체 엔진을
 빌드하며, D4를 만족시키기 위해 sentinel 텐서를 제출한다.
 ## First action (제일 처음에 하는 일)
 `kernbench.benches` 패키지가 임포트되면 `__init__.py` 가 즉시
@@ -0,0 +1,137 @@
 # ADR-0054: 마일스톤 평가 bench — 자기완결적 sweep + figure bench
 ## Status
 Accepted (2026-05-22).
 ADR-0044(D1/D2)와 ADR-0045(D5)를 개정하고, ADR-0043/0044의 "로직이
 `scripts/` + `tests/`에 산다" 배치를 대체한다: GEMM/allreduce 평가
 하니스가 이제 사용자가 실행하여 모든 결과 + figure를 재생성하는
 자기완결적 **bench**가 된다.
 ## Context
 ADR-0043(allreduce 평가)과 ADR-0044(GEMM 평가)는 각 하니스를 **sweep**
 (수동 `scripts/` 드라이버, 또는 allreduce의 경우 parametrized 테스트
 자체) + committed 데이터를 렌더링하는 **figure 테스트**로 분리했다.
 따라서 sweep/render 로직은 `scripts/gemm_sweep.py`,
 `tests/gemm/_gemm_plot_helpers.py`, `tests/sccl/_allreduce_helpers.py`에
 존재했다.
 마일스톤 요구사항("사용자가 *하나의 bench*를 실행해 모든 결과와 플롯을
 생성하도록 allreduce + GEMM 평가를 리팩터")은 그 배치로는 충족 불가다:
 bench는 production 코드이며 **`tests/`를 import할 수 없다**(ADR-0007 레이어
 방향). 평가 로직은 bench에서 닿을 수 있도록 production으로 이동해야 했다.
 선택한 home은 별도 `kernbench.eval` 패키지가 아니라 bench 모듈 자체다.
 bench 파일은 임의의 모듈 레벨 코드를 가질 수 있으며, 하니스를 bench로
 합치면 도메인당 파일 하나가 유지되고 패키지 레이어가 하나 줄어든다.
 ## Decision
 ### D1. 두 마일스톤 bench가 평가 로직을 보유
 - `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep
  + 세 figure renderer(`scripts/gemm_sweep.py` +
  `tests/gemm/_gemm_plot_helpers.py`에서 이동).
 - `src/kernbench/benches/milestone_1h_ccl.py` — distributed allreduce
  드라이버, latency + buffer-kind sweep, topology diagram, FSIM 비교, 그리고
  direct-launch 패리티 레퍼런스(`tests/sccl/_allreduce_helpers.py`에서 이동).
 각 파일은 해당 도메인 평가 로직의 **단일 home**이다.
 ### D2. "평가 bench" 패턴 (ADR-0045 D5 확장)
 ADR-0045 D5는 bench를 단일 구성(single-SIP, 또는 ADR-0024 multi-SIP CCL
 예외)으로 고정했다. 본 ADR은 세 번째 패턴을 추가한다:
 - **평가 bench**는 *여러* 구성을 구동하고 figure를 렌더링할 수 있다. 외부
  `run_bench` 엔진 대신 sweep 지점마다 자체 `GraphEngine` /
  `RuntimeContext`를 빌드한다.
 - 그러면 외부 ctx에 제출된 handle이 없으므로, bench는 마지막에
  **sentinel 텐서**(`torch.zeros((1, 1), …)`)를 제출하여 `run_bench`의
  "최소 한 번 제출" 계약(ADR-0045 D4)을 만족시키고 CLI가 0으로 종료되게
  한다.
 ### D3. 출력 위치
 두 bench 모두 `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`에
 쓴다(사용자 요청 — bench 옆 아티팩트). 디렉터리는 생성된 PNG/CSV/JSON만
 보유하며(`.py`/`__init__.py` 없음), 따라서 eager-import audit(ADR-0045
 첫 동작)이 무시한다 — `pkgutil.iter_modules`는 비-패키지 하위 디렉터리를
 yield하지 않는다. committed `docs/diagrams/` 아티팩트와 달리
 **git-ignore**된다(요청 시 재생성 가능).
 ### D4. GEMM 무거운 sweep — 기본은 fresh, `MILESTONE_FAST`로 재사용
 `milestone-1h-gemm`은 기본적으로 전체 24-sim sweep을 실행한다(분 단위;
 한 shape는 2048 tile). `MILESTONE_FAST=1`은 committed
 `docs/diagrams/gemm_sweep.json`을 재사용하고 렌더링만 한다(초 단위). 이는
 ADR-0044 D1/D2의 "무거운 sweep은 수동/`slow` 단계로 유지"를 뒤집는다:
 bench 실행이 곧 재생성이다. slow 경로는 `@pytest.mark.slow` bench
 테스트로 행사되고, fast 경로는 기본 실행된다.
 ### D5. 테스트 + 스크립트는 thin re-export shim으로 재사용 (단일 home 유지)
 기존 figure 테스트와 `scripts/gemm_sweep.py` 진입점은 유지되며 이제 bench
 모듈을 재사용한다:
 - `tests/gemm/_gemm_plot_helpers.py` → renderer +
  `GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT`를
  `kernbench.benches.milestone_1h_gemm`에서 re-export.
 - `tests/sccl/_allreduce_helpers.py` → 드라이버 코어, config writer, sweep
  상수, renderer, disk aggregator를 `kernbench.benches.milestone_1h_ccl`에서
  re-export하고, **pytest 전용** 조각은 로컬 유지: `pytest.param` 행렬
  (`CONFIGS` / `_sweep_params` / `_bk_params`)과 fixture 결합
  `_run_distributed`(`monkeypatch.chdir` + `_drive_distributed`) wrapper.
 - `scripts/gemm_sweep.py` → bench의 `run_sweep` 위 thin wrapper.
 테스트가 bench 모듈을 import하는 것은 허용된다(테스트는 production 위에
 위치, ADR-0007); 이는 전체 패키지 eager audit을 유발하며, 그것은 이미 매
 `kernbench` 실행 시 동작한다. matplotlib는 renderer 내부에서 lazy import로
 유지되어 audit의 startup 비용은 불변이다.
 ### D6. 평면 모듈 네이밍 (`benches/` 하위 폴더 없음)
 `1H_milestone…`로 명명된 `benches/` 하위 패키지는 불가능하다 — Python
 패키지 이름은 숫자로 시작할 수 없다. 따라서 bench는 평면 모듈
 `milestone_1h_gemm.py` / `milestone_1h_ccl.py`이며 bench 이름은
 `milestone-1h-gemm` / `milestone-1h-ccl`(kebab-case, ADR-0045 D1에 따라
 글자로 시작)이다.
 ## Consequences
 ### Positive
 - `kernbench run --bench milestone-1h-gemm`(또는 `…-ccl`)이 도메인의 모든
  결과 + figure를 한 명령으로 재생성한다 — 마일스톤 요구사항.
 - 평가 로직의 단일 소스(bench), shim을 통해 테스트와 스크립트가 재사용;
  중복 없음.
 - figure 테스트와 `scripts/gemm_sweep.py`는 변경 없이 계속 동작.
 ### Negative / limitations
 - 두 bench 파일이 크다(CCL 쪽은 distributed 드라이버, sweep, matplotlib
  드로잉을 섞는다). 대부분 평가 하니스인 "bench"는 이례적이며, 본 ADR이
  이를 정당화한다.
 - 생성 아티팩트가 명시적 요청에 의해 source tree(`src/kernbench/benches/`)
  안에 산다; 커밋을 피하려 git-ignore.
 - `milestone-1h-ccl`(및 기본 `milestone-1h-gemm`)은 분 단위 소요 —
  on-demand 마일스톤 아티팩트에는 수용 가능, 일상 실행에는 아님.
 ## Dependencies
 - **ADR-0007**: 레이어 방향(테스트는 production을 import할 수 있으나 bench는
  테스트를 import할 수 없는 이유).
 - **ADR-0043 / ADR-0044**: 본 ADR이 bench로 이전하는 allreduce / GEMM 평가
  하니스.
 - **ADR-0045**: bench 모듈 계약; 여기 D2가 그 D5(single-device 규칙)를
  평가-bench 패턴으로 확장하고, sentinel을 위해 D4(NO_REQUESTS)에 의존.
 - **ADR-0024**: allreduce sweep이 구동하는 rank = SIP launcher.
 ## Open questions
 - GEMM theoretical 모델 상수(ADR-0044 D5)를 복사 대신 ADR-0033/0014에서
  소싱해야 하는가? 본 ADR로는 불변.
 - `build_overview_slides.py`가 GEMM 막대를 네이티브로 그리는 대신 마일스톤
  출력 PNG를 소비해야 하는가? 여전히 open(ADR-0044 D6 / Negative).
@@ -1,6 +1,6 @@
 # ADR Index
-Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **46**.
+Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **47**.
 Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
@@ -152,6 +152,7 @@ One subsection per component file under `src/kernbench/components/builtin/`.
 - [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce 평가 하니스 — `tests/sccl/`
 - [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM 평가 하니스 — `scripts/gemm_sweep.py` + `tests/gemm/`
 - [ADR-0054](./ADR-0054-eval-milestone-benches.md) — 마일스톤 평가 bench — 자기완결적 sweep + figure bench
 ### Bench Module Contract
@@ -7,6 +7,11 @@ Accepted
 Documents the `tests/sccl/` evaluation harness; verified against the
 implementation (constants, file set, and sweep dimensions cross-checked).
 **Amended by ADR-0054**: the driver core, sweeps, and renderers moved into
 the `milestone-1h-ccl` bench (single home); `tests/sccl/_allreduce_helpers.py`
 now re-exports from it (keeping the pytest-only param builders +
 `_run_distributed` wrapper local). The figure tests are unchanged.
 ## Context
 ADR-0032 defines the intercube all-reduce *algorithm*; ADR-0023/0024/0027
@@ -9,6 +9,12 @@ implementation (constants, tile sizes, figure set, and the script↔test
 split cross-checked). The D5/D6 caveats are recorded limitations, not
 inaccuracies.
 **Amended by ADR-0054**: the sweep + renderers moved into the
 `milestone-1h-gemm` bench (single home); `scripts/gemm_sweep.py` and
 `tests/gemm/` now re-export from it. D1/D2's "data generation stays a manual
 script / heavy work is opt-in" is superseded by the eval-bench pattern (one
 bench regenerates everything; `MILESTONE_FAST=1` reuses the committed JSON).
 ## Context
 ADR-0014 (PE pipeline) and ADR-0042 (tile-plan generators) define the GEMM
@@ -10,6 +10,10 @@ module must follow. ADR-0010 (CLI surface) specifies the `kernbench
 list/run` interface, but **how benches are registered and what signature
 they must follow** had no ADR-level coverage.
 **Extended by ADR-0054**: D5's single-config rule gains a third pattern —
 the *eval bench* (e.g. `milestone-1h-*`) drives many configs, builds its
 own per-config engines, and submits a sentinel tensor to satisfy D4.
 ## First action
 When `kernbench.benches` is imported, `__init__.py` immediately calls
@@ -0,0 +1,141 @@
 # ADR-0054: Milestone Eval Benches — self-contained sweep + figure benches
 ## Status
 Accepted (2026-05-22).
 Amends ADR-0044 (D1/D2) and ADR-0045 (D5) and supersedes the "logic lives
 in `scripts/` + `tests/`" arrangement of ADR-0043/0044: the GEMM and
 allreduce evaluation harnesses are now self-contained **benches** that a
 user runs to regenerate every result + figure.
 ## Context
 ADR-0043 (allreduce eval) and ADR-0044 (GEMM eval) split each harness into
 a **sweep** (a manual `scripts/` driver, or — for allreduce — the
 parametrized tests themselves) plus **figure tests** that render committed
 data. The sweep/render logic therefore lived under `scripts/gemm_sweep.py`,
 `tests/gemm/_gemm_plot_helpers.py`, and `tests/sccl/_allreduce_helpers.py`.
 A milestone requirement ("refactor allreduce + GEMM evaluation so a user
 can run *one bench* to generate all the results and plots") cannot be met
 by that layout: a bench is production code and **must not import from
 `tests/`** (ADR-0007 layer direction). The eval logic had to move into
 production, reachable from a bench.
 The chosen home is the bench module itself — not a separate
 `kernbench.eval` package. A bench file may contain arbitrary module-level
 code; collapsing the harness into the bench keeps one file per domain and
 avoids an extra package layer.
 ## Decision
 ### D1. Two milestone benches own the eval logic
 - `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep +
  the three figure renderers (moved from `scripts/gemm_sweep.py` +
  `tests/gemm/_gemm_plot_helpers.py`).
 - `src/kernbench/benches/milestone_1h_ccl.py` — the distributed allreduce
  driver, latency + buffer-kind sweeps, topology diagram, FSIM comparison,
  and the direct-launch parity reference (moved from
  `tests/sccl/_allreduce_helpers.py`).
 Each file is the **single home** for its domain's eval logic.
 ### D2. The "eval bench" pattern (extends ADR-0045 D5)
 ADR-0045 D5 fixed a bench to a single configuration (single-SIP, or the
 ADR-0024 multi-SIP CCL exception). This ADR adds a third pattern:
 - An **eval bench** may drive *many* configurations and render figures. It
  builds its own per-config `GraphEngine` / `RuntimeContext` instances
  (one per sweep point) rather than using the outer `run_bench` engine.
 - Because the outer ctx then has no submitted handles, the bench submits a
  **sentinel tensor** (`torch.zeros((1, 1), …)`) at the end to satisfy
  `run_bench`'s "must submit at least one request" contract (ADR-0045 D4),
  so the CLI exits 0.
 ### D3. Output location
 Both benches write to `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`
 (per user request — artifacts beside the bench). The directory holds only
 generated PNG/CSV/JSON (never a `.py`/`__init__.py`), so the eager-import
 audit (ADR-0045 first action) ignores it — `pkgutil.iter_modules` does not
 yield non-package subdirectories. It is **git-ignored** (regenerable on
 demand), unlike the committed `docs/diagrams/` artifacts.
 ### D4. GEMM heavy sweep — fresh by default, `MILESTONE_FAST` to reuse
 `milestone-1h-gemm` runs the full 24-sim sweep by default (minutes; one
 shape is 2048 tiles). `MILESTONE_FAST=1` reuses the committed
 `docs/diagrams/gemm_sweep.json` and only re-renders (seconds). This
 reverses ADR-0044 D1/D2's "heavy sweep stays a manual/`slow`-marked step":
 running the bench *is* the regeneration. The slow path is exercised by a
 `@pytest.mark.slow` bench test; the fast path runs by default.
 ### D5. Tests + script reuse via thin re-export shims (single home kept)
 The pre-existing figure tests and the `scripts/gemm_sweep.py` entry point
 are retained and now reuse the bench modules:
 - `tests/gemm/_gemm_plot_helpers.py` → re-exports the renderers +
  `GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT` from
  `kernbench.benches.milestone_1h_gemm`.
 - `tests/sccl/_allreduce_helpers.py` → re-exports the driver core, config
  writers, sweep constants, renderers, and disk aggregators from
  `kernbench.benches.milestone_1h_ccl`, and keeps the **pytest-only** pieces
  local: the `pytest.param` matrices (`CONFIGS` / `_sweep_params` /
  `_bk_params`) and the fixture-coupled `_run_distributed`
  (`monkeypatch.chdir` + `_drive_distributed`) wrapper.
 - `scripts/gemm_sweep.py` → thin wrapper over the bench's `run_sweep`.
 Tests importing a bench module is permitted (tests sit above production,
 ADR-0007); it triggers the whole-package eager audit, which already runs on
 every `kernbench` invocation. matplotlib stays lazily imported inside the
 renderers, so the audit's startup cost is unchanged.
 ### D6. Flat module naming (no `benches/` subfolder)
 A `benches/` subpackage named `1H_milestone…` is impossible — a Python
 package name cannot start with a digit. The benches are therefore flat
 modules `milestone_1h_gemm.py` / `milestone_1h_ccl.py` with bench names
 `milestone-1h-gemm` / `milestone-1h-ccl` (kebab-case, letter-first per
 ADR-0045 D1).
 ## Consequences
 ### Positive
 - `kernbench run --bench milestone-1h-gemm` (or `…-ccl`) regenerates all of
  a domain's results + figures in one command — the milestone requirement.
 - Single source for the eval logic (the bench), reused by tests and the
  script via shims; no duplication.
 - The figure tests and `scripts/gemm_sweep.py` keep working unchanged.
 ### Negative / limitations
 - The two bench files are large (the CCL one mixes the distributed driver,
  sweeps, and matplotlib drawing). A "bench" that is mostly an eval harness
  is unusual; this ADR legitimizes it.
 - Generated artifacts live inside the source tree (`src/kernbench/benches/`)
  by explicit request; git-ignored to avoid committing them.
 - `milestone-1h-ccl` (and the default `milestone-1h-gemm`) take minutes —
  acceptable for an on-demand milestone artifact, not for routine runs.
 ## Dependencies
 - **ADR-0007**: layer direction (why tests may import production but a bench
  may not import tests).
 - **ADR-0043 / ADR-0044**: the allreduce / GEMM eval harnesses this ADR
  relocates into benches.
 - **ADR-0045**: bench module contract; D2 here extends its D5 (single-device
  rule) with the eval-bench pattern, and relies on D4 (NO_REQUESTS) for the
  sentinel.
 - **ADR-0024**: rank = SIP launcher driven by the allreduce sweeps.
 ## Open questions
 - Should the GEMM theoretical-model constants (ADR-0044 D5) be sourced from
  ADR-0033/0014 rather than copied? Unchanged by this ADR.
 - Should `build_overview_slides.py` consume the milestone output PNGs
  instead of drawing GEMM bars natively? Still open (ADR-0044 D6 / Negative).
@@ -1,6 +1,6 @@
 # ADR Index
-Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **46**.
+Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **47**.
 Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
@@ -152,6 +152,7 @@ One subsection per component file under `src/kernbench/components/builtin/`.
 - [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce Evaluation Harness — `tests/sccl/`
 - [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/`
 - [ADR-0054](./ADR-0054-eval-milestone-benches.md) — Milestone Eval Benches — self-contained sweep + figure benches
 ### Bench Module Contract
@@ -1,237 +1,20 @@
 """Sweep GEMM shapes through kernbench and dump PE_accelerator engine times.
-For each shape:
+Thin wrapper: the sweep logic now lives in
-  - run benches.matmul_composite via the same run_bench path the CLI uses
+``kernbench.benches.milestone_1h_gemm`` (the single home, ADR-0054, also the
-  - read result.engine.op_log
+``milestone-1h-gemm`` bench). This script remains the manual entry point for
-  - filter to per-PE engines: pe_dma, pe_fetch_store, pe_gemm, pe_math
+regenerating ``docs/diagrams/gemm_sweep.json`` on demand and honors the same
-  - record sum-of-durations (engine occupancy) AND wall-clock active interval
+``SWEEP_SHAPES`` / ``SWEEP_TOPOLOGY`` env overrides.
-Output: docs/diagrams/gemm_sweep.json
+    python scripts/gemm_sweep.py
 """
 from __future__ import annotations
-import json
+from kernbench.benches.milestone_1h_gemm import run_sweep
 import os
 import sys
 import time
 from pathlib import Path
 # Default sweep covering under-tile, single-tile, multi-tile, and asymmetric regimes.
 # Each entry is either a single integer (square M=K=N=S) or "MxKxN".
 # Override via env: SWEEP_SHAPES="16,32,16x2048x16,..."
 DEFAULT_SHAPES = [
    "32x32x32",       # 1 tile, K=32 < TILE_K=64 → under-tile in K
    "32x64x32",       # 1 tile, exact single-tile fit
    "32x128x32",      # 2 tiles, aligned
    "32x128x128",     # 8 tiles, aligned
    "32x3072x32",     # 48 tiles, all K-axis (tall-skinny)
    "8x128x128",      # 8 tiles, but M=8 < TILE_M=32 → MAC array under-fed
    "128x8x128",      # 16 tiles, but K=8 < TILE_K=64 → MAC array under-fed
    "512",            # 2048 tiles, fully aligned — "well-pipelined" reference
 ]
 # Operand-staging variants exercised per shape.
 VARIANTS = ["ref_ref", "load_ref", "load_load"]
 # Engines whose timings we collect (component_id suffix match).
 ENGINES = ["pe_dma", "pe_fetch_store", "pe_gemm", "pe_math"]
 # Per-stage breakdown labels (StageType enum names from pe_types.py).
 STAGES = ["DMA_READ", "DMA_WRITE", "FETCH", "STORE", "GEMM", "MATH"]
 # Scheduler tile sizes (mirror of PeSchedulerComponent.TILE_M/K/N).
 TILE_M, TILE_K, TILE_N = 32, 64, 32
 OUT_PATH = Path(__file__).parent.parent / "docs" / "diagrams" / "gemm_sweep.json"
 def _engine_wall_ns(records, suffix: str) -> float:
    """Wall-clock interval the engine was active (union of overlapping ops)."""
    intervals = [(r.t_start, r.t_end) for r in records
                 if r.component_id.endswith("." + suffix)]
    if not intervals:
        return 0.0
    intervals.sort()
    merged_end = intervals[0][1]
    merged_start = intervals[0][0]
    total = 0.0
    for s, e in intervals[1:]:
        if s <= merged_end:
            merged_end = max(merged_end, e)
        else:
            total += merged_end - merged_start
            merged_start, merged_end = s, e
    total += merged_end - merged_start
    return total
 def _engine_occupancy_ns(records, suffix: str) -> float:
    return sum(r.t_end - r.t_start for r in records
               if r.component_id.endswith("." + suffix))
 def _engine_count(records, suffix: str) -> int:
    return sum(1 for r in records if r.component_id.endswith("." + suffix))
 def _stage_occupancy_ns(records, stage_type: str) -> float:
    """Sum t_end - t_start over op_log records whose params.stage_type matches.
    Requires op_log records produced post the TileToken stage_type capture
    (sim_engine/op_log.py).
    """
    return sum(
        r.t_end - r.t_start
        for r in records
        if r.params.get("stage_type") == stage_type
    )
 def _stage_wall_ns(records, stage_type: str) -> float:
    """Interval-union wall-clock for records whose stage_type matches."""
    intervals = sorted(
        (r.t_start, r.t_end) for r in records
        if r.params.get("stage_type") == stage_type
    )
    if not intervals:
        return 0.0
    total = 0.0
    cs, ce = intervals[0]
    for s, e in intervals[1:]:
        if s <= ce:
            ce = max(ce, e)
        else:
            total += ce - cs
            cs, ce = s, e
    total += ce - cs
    return total
 def _stage_count(records, stage_type: str) -> int:
    return sum(1 for r in records if r.params.get("stage_type") == stage_type)
 def _run_one(M: int, K: int, N: int, topology: str, variant: str = "ref_ref") -> dict:
    os.environ["MATMUL_M"] = str(M)
    os.environ["MATMUL_K"] = str(K)
    os.environ["MATMUL_N"] = str(N)
    os.environ["MATMUL_VARIANT"] = variant
    # Late imports so env vars are read by matmul_composite at module load.
    # Force re-import to pick up new env values.
    for mod_name in [m for m in list(sys.modules) if m.startswith("kernbench.benches.matmul_composite")]:
        del sys.modules[mod_name]
    from kernbench.benches.registry import resolve as resolve_bench
    from kernbench.runtime_api.bench_runner import run_bench
    from kernbench.runtime_api.types import resolve_device
    from kernbench.sim_engine.engine import GraphEngine
    from kernbench.topology.builder import resolve_topology
    topo = resolve_topology(topology)
    bench = resolve_bench("matmul-composite").run
    device = resolve_device(None)
    t0 = time.time()
    result = run_bench(
        topology=topo, bench_fn=bench, device=device,
        engine_factory=lambda t, d: GraphEngine(
            getattr(t, "topology_obj", t), enable_data=True,
        ),
    )
    wall = time.time() - t0
    op_log = result.engine.op_log
    if not result.completion.ok:
        raise RuntimeError(f"bench failed at M={M},K={K},N={N}: {result.completion}")
    # Bytes touched at f16 (2 B): full A + full B + full out (each operand
    # streamed once through HBM by the composite plan).
    bytes_total = (M * K + K * N + M * N) * 2
    row = {
        "M": M, "K": K, "N": N,
        "variant": variant,
        "flops": 2 * M * K * N,
        "bytes_hbm": bytes_total,
        "arith_intensity": (2 * M * K * N) / bytes_total,  # flops/byte
        "tile_count_expected": _ceil(M, TILE_M) * _ceil(N, TILE_N) * _ceil(K, TILE_K),
        "sim_wall_clock_s": round(wall, 3),
        "engines": {},
    }
    for eng in ENGINES:
        row["engines"][eng] = {
            "occupancy_ns": _engine_occupancy_ns(op_log, eng),
            "wall_ns":      _engine_wall_ns(op_log, eng),
            "record_count": _engine_count(op_log, eng),
        }
    row["stages"] = {}
    for stage in STAGES:
        row["stages"][stage] = {
            "occupancy_ns": _stage_occupancy_ns(op_log, stage),
            "wall_ns":      _stage_wall_ns(op_log, stage),
            "record_count": _stage_count(op_log, stage),
        }
    # Kernel-window wall-clock = max t_end - min t_start over PE engine records.
    pe_records = [r for r in op_log
                  if any(r.component_id.endswith("." + e) for e in ENGINES)]
    if pe_records:
        row["pe_window_ns"] = max(r.t_end for r in pe_records) \
                              - min(r.t_start for r in pe_records)
    else:
        row["pe_window_ns"] = 0.0
    stage_records = [r for r in op_log
                     if r.params.get("stage_type") in STAGES]
    if stage_records:
        row["composite_window_ns"] = max(r.t_end for r in stage_records) \
                                     - min(r.t_start for r in stage_records)
    else:
        row["composite_window_ns"] = 0.0
    return row
 def _ceil(a: int, b: int) -> int:
    return (a + b - 1) // b
 def main() -> int:
-    shapes_env = os.environ.get("SWEEP_SHAPES")
+    run_sweep()
    raw = (shapes_env.split(",") if shapes_env else DEFAULT_SHAPES)
    shapes: list[tuple[int, int, int]] = []
    for s in raw:
        s = s.strip()
        if not s:
            continue
        if "x" in s.lower():
            parts = s.lower().split("x")
            shapes.append((int(parts[0]), int(parts[1]), int(parts[2])))
        else:
            v = int(s)
            shapes.append((v, v, v))
    topology = os.environ.get("SWEEP_TOPOLOGY", "topology.yaml")
    rows = []
    for M, K, N in shapes:
        for variant in VARIANTS:
            print(f"[sweep] M={M} K={K} N={N} variant={variant} ...", flush=True)
            row = _run_one(M, K, N, topology, variant=variant)
            rows.append(row)
            eng_dma = row["engines"]["pe_dma"]
            eng_gem = row["engines"]["pe_gemm"]
            print(f"   tiles={row['tile_count_expected']:>6}  "
                  f"pe_window={row['pe_window_ns']:8.1f}ns  "
                  f"dma_occ={eng_dma['occupancy_ns']:9.1f}  "
                  f"gemm_occ={eng_gem['occupancy_ns']:8.1f}  "
                  f"(sim {row['sim_wall_clock_s']:.1f}s)")
    OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
    OUT_PATH.write_text(json.dumps({
        "tile_sizes": {"M": TILE_M, "K": TILE_K, "N": TILE_N},
        "engines": ENGINES,
        "stages": STAGES,
        "variants": VARIANTS,
        "rows": rows,
    }, indent=2))
    print(f"\n[sweep] wrote {OUT_PATH}")
    return 0
@@ -0,0 +1,568 @@
 """milestone-1h-gemm bench: GEMM evaluation harness (sweep + figures).
 Self-contained milestone bench (ADR-0054). Holds the shape×variant sweep
 and the figure renderers; the ``run(torch)`` entry at the bottom runs the
 sweep (or reuses the committed JSON when ``MILESTONE_FAST=1``) and writes
 every figure into ``benches/1H_milestone_output/gemm/``.
 This is the single home for the GEMM eval logic: the figure tests import a
 thin re-export shim (``tests/gemm/_gemm_plot_helpers.py``), as does the
 ``scripts/gemm_sweep.py`` wrapper.
 The sweep drives ``matmul-composite`` across shapes×variants through the
 same ``run_bench`` path the CLI uses, harvests ``result.engine.op_log``,
 and writes the sweep JSON. The renderers read that JSON and emit matplotlib
 PNGs. No simulation in the renderers — they are fast.
 Chart set (mirrors the GEMM MAC slides in scripts/build_overview_slides.py):
  - stage breakdown (load_ref operand staging)
  - MAC utilization — measured (load_ref)
  - MAC utilization — theoretical vs measured (load_ref)
 """
 from __future__ import annotations
 import json
 import os
 import sys
 import time
 from pathlib import Path
 from kernbench.benches.registry import bench
 from kernbench.policy.placement.dp import DPPolicy
 ROOT = Path(__file__).resolve().parents[3]
 DEFAULT_SWEEP_JSON = ROOT / "docs" / "diagrams" / "gemm_sweep.json"
 DEFAULT_PLOTS_DIR = ROOT / "docs" / "diagrams" / "gemm_plots"
 _OUTPUT_DIR = Path(__file__).resolve().parent / "1H_milestone_output" / "gemm"
 # ── sweep configuration ────────────────────────────────────────────────
 # Default sweep covering under-tile, single-tile, multi-tile, and asymmetric
 # regimes. Each entry is "MxKxN" or a single int (square M=K=N).
 # Override via env: SWEEP_SHAPES="16,32,16x2048x16,..."
 DEFAULT_SHAPES = [
    "32x32x32",       # 1 tile, K=32 < TILE_K=64 → under-tile in K
    "32x64x32",       # 1 tile, exact single-tile fit
    "32x128x32",      # 2 tiles, aligned
    "32x128x128",     # 8 tiles, aligned
    "32x3072x32",     # 48 tiles, all K-axis (tall-skinny)
    "8x128x128",      # 8 tiles, but M=8 < TILE_M=32 → MAC array under-fed
    "128x8x128",      # 16 tiles, but K=8 < TILE_K=64 → MAC array under-fed
    "512",            # 2048 tiles, fully aligned — "well-pipelined" reference
 ]
 # Operand-staging variants exercised per shape.
 VARIANTS = ["ref_ref", "load_ref", "load_load"]
 # Engines whose timings we collect (component_id suffix match).
 ENGINES = ["pe_dma", "pe_fetch_store", "pe_gemm", "pe_math"]
 # Per-stage breakdown labels (StageType enum names from pe_types.py).
 STAGES = ["DMA_READ", "DMA_WRITE", "FETCH", "STORE", "GEMM", "MATH"]
 # Scheduler tile sizes (mirror of PeSchedulerComponent.TILE_M/K/N).
 TILE_M, TILE_K, TILE_N = 32, 64, 32
 def _ceil(a: int, b: int) -> int:
    return (a + b - 1) // b
 def _engine_wall_ns(records, suffix: str) -> float:
    """Wall-clock interval the engine was active (union of overlapping ops)."""
    intervals = [(r.t_start, r.t_end) for r in records
                 if r.component_id.endswith("." + suffix)]
    if not intervals:
        return 0.0
    intervals.sort()
    merged_end = intervals[0][1]
    merged_start = intervals[0][0]
    total = 0.0
    for s, e in intervals[1:]:
        if s <= merged_end:
            merged_end = max(merged_end, e)
        else:
            total += merged_end - merged_start
            merged_start, merged_end = s, e
    total += merged_end - merged_start
    return total
 def _engine_occupancy_ns(records, suffix: str) -> float:
    return sum(r.t_end - r.t_start for r in records
               if r.component_id.endswith("." + suffix))
 def _engine_count(records, suffix: str) -> int:
    return sum(1 for r in records if r.component_id.endswith("." + suffix))
 def _stage_occupancy_ns(records, stage_type: str) -> float:
    return sum(
        r.t_end - r.t_start
        for r in records
        if r.params.get("stage_type") == stage_type
    )
 def _stage_wall_ns(records, stage_type: str) -> float:
    """Interval-union wall-clock for records whose stage_type matches."""
    intervals = sorted(
        (r.t_start, r.t_end) for r in records
        if r.params.get("stage_type") == stage_type
    )
    if not intervals:
        return 0.0
    total = 0.0
    cs, ce = intervals[0]
    for s, e in intervals[1:]:
        if s <= ce:
            ce = max(ce, e)
        else:
            total += ce - cs
            cs, ce = s, e
    total += ce - cs
    return total
 def _stage_count(records, stage_type: str) -> int:
    return sum(1 for r in records if r.params.get("stage_type") == stage_type)
 def _run_one(M: int, K: int, N: int, topology: str, variant: str = "ref_ref") -> dict:
    os.environ["MATMUL_M"] = str(M)
    os.environ["MATMUL_K"] = str(K)
    os.environ["MATMUL_N"] = str(N)
    os.environ["MATMUL_VARIANT"] = variant
    # Late imports so env vars are read by matmul_composite at module load.
    # Force re-import to pick up new env values.
    for mod_name in [m for m in list(sys.modules)
                     if m.startswith("kernbench.benches.matmul_composite")]:
        del sys.modules[mod_name]
    from kernbench.benches.registry import resolve as resolve_bench
    from kernbench.runtime_api.bench_runner import run_bench
    from kernbench.runtime_api.types import resolve_device
    from kernbench.sim_engine.engine import GraphEngine
    from kernbench.topology.builder import resolve_topology
    topo = resolve_topology(topology)
    bench = resolve_bench("matmul-composite").run
    device = resolve_device(None)
    t0 = time.time()
    result = run_bench(
        topology=topo, bench_fn=bench, device=device,
        engine_factory=lambda t, d: GraphEngine(
            getattr(t, "topology_obj", t), enable_data=True,
        ),
    )
    wall = time.time() - t0
    op_log = result.engine.op_log
    if not result.completion.ok:
        raise RuntimeError(f"bench failed at M={M},K={K},N={N}: {result.completion}")
    # Bytes touched at f16 (2 B): full A + full B + full out (each operand
    # streamed once through HBM by the composite plan).
    bytes_total = (M * K + K * N + M * N) * 2
    row = {
        "M": M, "K": K, "N": N,
        "variant": variant,
        "flops": 2 * M * K * N,
        "bytes_hbm": bytes_total,
        "arith_intensity": (2 * M * K * N) / bytes_total,  # flops/byte
        "tile_count_expected": _ceil(M, TILE_M) * _ceil(N, TILE_N) * _ceil(K, TILE_K),
        "sim_wall_clock_s": round(wall, 3),
        "engines": {},
    }
    for eng in ENGINES:
        row["engines"][eng] = {
            "occupancy_ns": _engine_occupancy_ns(op_log, eng),
            "wall_ns":      _engine_wall_ns(op_log, eng),
            "record_count": _engine_count(op_log, eng),
        }
    row["stages"] = {}
    for stage in STAGES:
        row["stages"][stage] = {
            "occupancy_ns": _stage_occupancy_ns(op_log, stage),
            "wall_ns":      _stage_wall_ns(op_log, stage),
            "record_count": _stage_count(op_log, stage),
        }
    # Kernel-window wall-clock = max t_end - min t_start over PE engine records.
    pe_records = [r for r in op_log
                  if any(r.component_id.endswith("." + e) for e in ENGINES)]
    if pe_records:
        row["pe_window_ns"] = max(r.t_end for r in pe_records) \
                              - min(r.t_start for r in pe_records)
    else:
        row["pe_window_ns"] = 0.0
    stage_records = [r for r in op_log
                     if r.params.get("stage_type") in STAGES]
    if stage_records:
        row["composite_window_ns"] = max(r.t_end for r in stage_records) \
                                     - min(r.t_start for r in stage_records)
    else:
        row["composite_window_ns"] = 0.0
    return row
 def _parse_shapes(raw) -> list[tuple[int, int, int]]:
    shapes: list[tuple[int, int, int]] = []
    for s in raw:
        s = s.strip()
        if not s:
            continue
        if "x" in s.lower():
            parts = s.lower().split("x")
            shapes.append((int(parts[0]), int(parts[1]), int(parts[2])))
        else:
            v = int(s)
            shapes.append((v, v, v))
    return shapes
 def run_sweep(out_json: Path | str = DEFAULT_SWEEP_JSON) -> Path:
    """Drive matmul-composite across shapes×variants; write the sweep JSON.
    Honors ``SWEEP_SHAPES`` / ``SWEEP_TOPOLOGY`` env overrides (same as the
    historical ``scripts/gemm_sweep.py``). Returns the JSON path written.
    """
    shapes_env = os.environ.get("SWEEP_SHAPES")
    raw = (shapes_env.split(",") if shapes_env else DEFAULT_SHAPES)
    shapes = _parse_shapes(raw)
    topology = os.environ.get("SWEEP_TOPOLOGY", "topology.yaml")
    rows = []
    for M, K, N in shapes:
        for variant in VARIANTS:
            print(f"[sweep] M={M} K={K} N={N} variant={variant} ...", flush=True)
            row = _run_one(M, K, N, topology, variant=variant)
            rows.append(row)
            eng_dma = row["engines"]["pe_dma"]
            eng_gem = row["engines"]["pe_gemm"]
            print(f"   tiles={row['tile_count_expected']:>6}  "
                  f"pe_window={row['pe_window_ns']:8.1f}ns  "
                  f"dma_occ={eng_dma['occupancy_ns']:9.1f}  "
                  f"gemm_occ={eng_gem['occupancy_ns']:8.1f}  "
                  f"(sim {row['sim_wall_clock_s']:.1f}s)")
    out_json = Path(out_json)
    out_json.parent.mkdir(parents=True, exist_ok=True)
    out_json.write_text(json.dumps({
        "tile_sizes": {"M": TILE_M, "K": TILE_K, "N": TILE_N},
        "engines": ENGINES,
        "stages": STAGES,
        "variants": VARIANTS,
        "rows": rows,
    }, indent=2))
    print(f"\n[sweep] wrote {out_json}")
    return out_json
 # ── figure rendering ───────────────────────────────────────────────────
 # Shapes excluded from the figures (mirrors build_overview_slides).
 EXCLUDED_SHAPES = {(512, 512, 512)}
 # Stage bars shown (raw op_log stage_type keys) + display names + colors.
 STAGE_KEYS = ["DMA_READ", "FETCH", "GEMM", "DMA_WRITE"]
 STAGE_DISPLAY = {
    "DMA_READ":  "DMA in",
    "FETCH":     "Fetch",
    "GEMM":      "GEMM",
    "DMA_WRITE": "DMA out",
 }
 STAGE_COLORS = {
    "DMA_READ":  "#3B82F6",
    "FETCH":     "#10B981",
    "GEMM":      "#F59E0B",
    "DMA_WRITE": "#A855F7",
 }
 # MAC-utilization model constants (mirror build_overview_slides).
 _HBM_GBS = 256.0
 _BPE = 2
 _T_STAGE = 16.0
 _D_STAGES = 3
 _PLOT_VARIANT = "load_ref"
 def _load_sweep_data(sweep_json: Path | str = DEFAULT_SWEEP_JSON) -> dict:
    sweep_json = Path(sweep_json)
    if not sweep_json.exists():
        return {"rows": []}
    data = json.loads(sweep_json.read_text())
    data["rows"] = [
        r for r in data.get("rows", [])
        if (r["M"], r["K"], r["N"]) not in EXCLUDED_SHAPES
    ]
    return data
 def _shape_label(r: dict) -> str:
    if r["M"] == r["K"] == r["N"]:
        return f"M=K=N={r['M']}"
    return f"M={r['M']} K={r['K']} N={r['N']}"
 def _under_tile(M, K, N, tile_M, tile_K, tile_N) -> bool:
    return M < tile_M or K < tile_K or N < tile_N
 def _xtick_labels(shape_labels, tile_counts, flagged) -> list[str]:
    out = []
    for lbl, tc, fl in zip(shape_labels, tile_counts, flagged):
        s = f"{lbl}\n({tc} tiles)"
        if fl:
            s += " *"
        out.append(s)
    return out
 def _grouped_bar_png(
    out_name: str, *, out_dir: Path, title: str, subtitle: str | None,
    shape_labels, tile_counts, flagged, series: dict, colors: dict,
    y_label: str, threshold: float | None = None, footnote: str | None = None,
 ) -> str:
    """Render one grouped-bar chart to out_dir/out_name; return the path."""
    import matplotlib.pyplot as plt
    import numpy as np
    n_groups = len(shape_labels)
    n_series = max(1, len(series))
    x = np.arange(n_groups)
    width = 0.8 / n_series
    fig, ax = plt.subplots(figsize=(11, 6))
    for i, (name, vals) in enumerate(series.items()):
        offset = (i - (n_series - 1) / 2) * width
        ax.bar(x + offset, vals, width, label=name, color=colors.get(name))
    ax.set_xticks(x)
    ax.set_xticklabels(
        _xtick_labels(shape_labels, tile_counts, flagged), fontsize=8,
    )
    ax.set_ylabel(y_label)
    ax.set_title(title, fontsize=13, fontweight="bold")
    if subtitle:
        ax.text(0.5, 1.01, subtitle, transform=ax.transAxes, ha="center",
                va="bottom", fontsize=8, color="#475569")
    if threshold is not None:
        ax.axhline(threshold, ls="--", color="gray", lw=1.0)
    ax.legend(fontsize=8, loc="upper right")
    ax.grid(True, axis="y", alpha=0.3)
    caption = "* = under-tile shape (M<TILE_M, K<TILE_K, or N<TILE_N)"
    if footnote:
        caption = footnote + "\n" + caption
    fig.text(0.5, 0.01, caption, ha="center", fontsize=7, color="gray",
             wrap=True)
    fig.tight_layout(rect=(0, 0.05, 1, 1))
    out_dir = Path(out_dir)
    out_dir.mkdir(parents=True, exist_ok=True)
    out = out_dir / out_name
    fig.savefig(out, dpi=120)
    plt.close(fig)
    return str(out)
 def emit_stage_breakdown(
    sweep_json: Path | str = DEFAULT_SWEEP_JSON,
    out_dir: Path | str = DEFAULT_PLOTS_DIR,
 ) -> str | None:
    """Per-stage engine wall-clock per shape (load_ref operand staging)."""
    data = _load_sweep_data(sweep_json)
    rows = [r for r in data["rows"] if r.get("variant") == _PLOT_VARIANT]
    if not rows:
        return None
    tile = data["tile_sizes"]
    shape_labels = [_shape_label(r) for r in rows]
    flagged = [_under_tile(r["M"], r["K"], r["N"], tile["M"], tile["K"], tile["N"])
               for r in rows]
    tile_counts = [r["tile_count_expected"] for r in rows]
    series = {
        STAGE_DISPLAY[s]: [r.get("stages", {}).get(s, {}).get("wall_ns", 0.0)
                           for r in rows]
        for s in STAGE_KEYS
    }
    colors = {STAGE_DISPLAY[s]: STAGE_COLORS[s] for s in STAGE_KEYS}
    return _grouped_bar_png(
        "gemm_stage_breakdown.png", out_dir=Path(out_dir),
        title="GEMM stage breakdown",
        subtitle=(f"Per-stage engine wall-clock (DMA in / Fetch / GEMM / "
                  f"DMA out), {_PLOT_VARIANT} staging. "
                  f"Tile {tile['M']}x{tile['K']}x{tile['N']}."),
        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
        series=series, colors=colors, y_label="ns",
        footnote="Bars = engine wall-clock interval (merged overlaps).",
    )
 def emit_mac_utilization_measured(
    sweep_json: Path | str = DEFAULT_SWEEP_JSON,
    out_dir: Path | str = DEFAULT_PLOTS_DIR,
 ) -> str | None:
    """GEMM util % and useful pipeline-eff % (analytical model, load_ref)."""
    data = _load_sweep_data(sweep_json)
    rows = data["rows"]
    if not rows:
        return None
    tile = data["tile_sizes"]
    TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
    tile_flops = 2 * TILE_M * TILE_K * TILE_N
    dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
    head_ns = (_D_STAGES - 1) * _T_STAGE
    by_shape = {(r["M"], r["K"], r["N"]): r
                for r in rows if r["variant"] == _PLOT_VARIANT}
    shapes = list(by_shape)
    if not shapes:
        return None
    shape_labels = [_shape_label(by_shape[k]) for k in shapes]
    flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
    tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
    gemm_util, useful_eff = [], []
    for k in shapes:
        r = by_shape[k]
        M, K, N = r["M"], r["K"], r["N"]
        useful = 2 * M * K * N
        tiles = r["tile_count_expected"]
        gu = useful / (tile_flops * tiles) * 100
        gemm_util.append(gu)
        m_tiles = (M + TILE_M - 1) // TILE_M
        n_tiles = (N + TILE_N - 1) // TILE_N
        n_mn = m_tiles * n_tiles
        compute_total = tiles * _T_STAGE
        wall = head_ns + tiles * _T_STAGE + max(0, n_mn - 1) * dma_w_per_pair
        ueff = (compute_total * (gu / 100.0) / wall) * 100 if wall > 0 else 0.0
        useful_eff.append(ueff)
    series = {"GEMM util %": gemm_util, "Useful eff %": useful_eff}
    colors = {"GEMM util %": "#10B981", "Useful eff %": "#F59E0B"}
    return _grouped_bar_png(
        "gemm_mac_utilization_measured.png", out_dir=Path(out_dir),
        title="GEMM MAC utilization — load_ref",
        subtitle=("GEMM util = useful FLOPs / (tile FLOPs x tiles); "
                  "Useful eff = GEMM util x ideal pipeline efficiency."),
        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
        series=series, colors=colors, y_label="%", threshold=100.0,
        footnote="Theoretical ideal-pipeline model (not simulator data).",
    )
 def emit_mac_utilization_theoretical_vs_measured(
    sweep_json: Path | str = DEFAULT_SWEEP_JSON,
    out_dir: Path | str = DEFAULT_PLOTS_DIR,
 ) -> str | None:
    """Theoretical vs simulator-measured GEMM util / useful eff (load_ref)."""
    data = _load_sweep_data(sweep_json)
    rows = data["rows"]
    if not rows:
        return None
    tile = data["tile_sizes"]
    TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
    tile_flops = 2 * TILE_M * TILE_K * TILE_N
    dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
    head_ns = (_D_STAGES - 1) * _T_STAGE
    peak_per_ns = tile_flops / _T_STAGE
    by_shape = {(r["M"], r["K"], r["N"]): r
                for r in rows if r["variant"] == _PLOT_VARIANT}
    shapes = list(by_shape)
    if not shapes:
        return None
    shape_labels = [_shape_label(by_shape[k]) for k in shapes]
    flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
    tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
    gu_t, gu_m, eff_t, eff_m = [], [], [], []
    for k in shapes:
        r = by_shape[k]
        M, K, N = r["M"], r["K"], r["N"]
        useful = 2 * M * K * N
        tiles = r["tile_count_expected"]
        gut = useful / (tile_flops * tiles)
        gu_t.append(gut * 100)
        rec = r.get("stages", {}).get("GEMM", {}).get("record_count", 0) or tiles
        gu_m.append((useful / (tile_flops * rec) * 100) if rec else 0.0)
        m_tiles = (M + TILE_M - 1) // TILE_M
        n_tiles = (N + TILE_N - 1) // TILE_N
        n_mn = m_tiles * n_tiles
        compute_total = tiles * _T_STAGE
        wall_t = head_ns + compute_total + max(0, n_mn - 1) * dma_w_per_pair
        eff_t.append((compute_total * gut / wall_t * 100) if wall_t > 0 else 0.0)
        cw = r.get("composite_window_ns", 0.0) or 0.0
        eff_m.append((useful / cw / peak_per_ns * 100) if cw > 0 else 0.0)
    series = {
        "GEMM util % (theoretical)": gu_t,
        "GEMM util % (measured)":    gu_m,
        "Theoretical eff %":         eff_t,
        "Measured eff %":            eff_m,
    }
    colors = {
        "GEMM util % (theoretical)": "#10B981",
        "GEMM util % (measured)":    "#6EE7B7",
        "Theoretical eff %":         "#F59E0B",
        "Measured eff %":            "#3B82F6",
    }
    return _grouped_bar_png(
        "gemm_mac_utilization_theoretical_vs_measured.png", out_dir=Path(out_dir),
        title="GEMM MAC utilization — theoretical vs measured (load_ref)",
        subtitle=("theoretical model vs simulator op_log; agreement "
                  "validates the analytical pipeline model."),
        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
        series=series, colors=colors, y_label="%", threshold=100.0,
    )
 def emit_all_gemm_plots(
    sweep_json: Path | str = DEFAULT_SWEEP_JSON,
    out_dir: Path | str = DEFAULT_PLOTS_DIR,
 ) -> list[str]:
    """Render every GEMM figure that has data; return the paths written."""
    paths = []
    for fn in (emit_stage_breakdown,
               emit_mac_utilization_measured,
               emit_mac_utilization_theoretical_vs_measured):
        p = fn(sweep_json, out_dir)
        if p:
            paths.append(p)
    return paths
 # ── bench entry ────────────────────────────────────────────────────────
@bench(
    name="milestone-1h-gemm",
    description="1H milestone: regenerate all GEMM results + figures.",
 )
 def run(torch) -> None:
    """Run the GEMM sweep (or reuse committed JSON) and render every figure.
    ``MILESTONE_FAST=1`` reuses the committed ``DEFAULT_SWEEP_JSON`` (seconds);
    otherwise the full sweep runs into ``out_dir/gemm_sweep.json`` (minutes).
    The sweep drives its own engines, so a sentinel tensor is submitted at the
    end to satisfy the run_bench contract (ADR-0045 D4).
    """
    _OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    fast = bool(os.environ.get("MILESTONE_FAST"))
    if fast:
        sweep_json = DEFAULT_SWEEP_JSON
    else:
        sweep_json = run_sweep(out_json=_OUTPUT_DIR / "gemm_sweep.json")
    paths = emit_all_gemm_plots(sweep_json=sweep_json, out_dir=_OUTPUT_DIR)
    print(f"  milestone-1h-gemm: {len(paths)} figures -> {_OUTPUT_DIR} "
          f"(fast={fast})")
    torch.zeros(
        (1, 1), dtype="f16",
        dp=DPPolicy(cube="row_wise", pe="replicate", num_cubes=1, num_pes=1),
        name="milestone_gemm_sentinel",
    )
@@ -1,283 +1,31 @@
-"""Shared plotting plumbing for the GEMM figure tests.
+"""Thin re-export shim for the GEMM figure tests.
-Not a test module (no ``test_`` prefix -> pytest does not collect it).
+Not a test module (no ``test_`` prefix → pytest does not collect it).
-Reads the committed ``docs/diagrams/gemm_sweep.json`` (produced by the heavy
+The sweep + renderer logic now lives in
-``scripts/gemm_sweep.py`` sim sweep) and renders matplotlib PNGs into
+``kernbench.benches.milestone_1h_gemm`` (production single home, ADR-0054,
-``docs/diagrams/gemm_plots/``. No simulation here -> the figure tests are fast
+also driven by ``scripts/gemm_sweep.py``). The figure tests import the same
-and run by default; regenerating the underlying data stays a manual script.
+names from here; behavior is unchanged (defaults still target
-
+``docs/diagrams/gemm_plots/``).
 Chart set (mirrors the GEMM MAC slides in scripts/build_overview_slides.py):
  - stage breakdown (load_ref operand staging)
  - MAC utilization — measured (load_ref)
  - MAC utilization — theoretical vs measured (load_ref)
 """
 from __future__ import annotations
-import json
+from kernbench.benches.milestone_1h_gemm import (
-from pathlib import Path
+    DEFAULT_PLOTS_DIR as GEMM_PLOTS_DIR,
    DEFAULT_SWEEP_JSON as GEMM_SWEEP_JSON,
    ROOT,
    emit_all_gemm_plots,
    emit_mac_utilization_measured,
    emit_mac_utilization_theoretical_vs_measured,
    emit_stage_breakdown,
 )
-ROOT = Path(__file__).resolve().parent.parent.parent
+__all__ = [
-GEMM_SWEEP_JSON = ROOT / "docs" / "diagrams" / "gemm_sweep.json"
+    "GEMM_PLOTS_DIR",
-GEMM_PLOTS_DIR = ROOT / "docs" / "diagrams" / "gemm_plots"
+    "GEMM_SWEEP_JSON",
-
+    "ROOT",
-# Shapes excluded from the figures (mirrors build_overview_slides).
+    "emit_all_gemm_plots",
-EXCLUDED_SHAPES = {(512, 512, 512)}
+    "emit_mac_utilization_measured",
-
+    "emit_mac_utilization_theoretical_vs_measured",
-# Stage bars shown (raw op_log stage_type keys) + display names + colors.
+    "emit_stage_breakdown",
-STAGE_KEYS = ["DMA_READ", "FETCH", "GEMM", "DMA_WRITE"]
+]
 STAGE_DISPLAY = {
    "DMA_READ":  "DMA in",
    "FETCH":     "Fetch",
    "GEMM":      "GEMM",
    "DMA_WRITE": "DMA out",
 }
 STAGE_COLORS = {
    "DMA_READ":  "#3B82F6",
    "FETCH":     "#10B981",
    "GEMM":      "#F59E0B",
    "DMA_WRITE": "#A855F7",
 }
 # MAC-utilization model constants (mirror build_overview_slides).
 _HBM_GBS = 256.0
 _BPE = 2
 _T_STAGE = 16.0
 _D_STAGES = 3
 _PLOT_VARIANT = "load_ref"
 def _load_sweep_data() -> dict:
    if not GEMM_SWEEP_JSON.exists():
        return {"rows": []}
    data = json.loads(GEMM_SWEEP_JSON.read_text())
    data["rows"] = [
        r for r in data.get("rows", [])
        if (r["M"], r["K"], r["N"]) not in EXCLUDED_SHAPES
    ]
    return data
 def _shape_label(r: dict) -> str:
    if r["M"] == r["K"] == r["N"]:
        return f"M=K=N={r['M']}"
    return f"M={r['M']} K={r['K']} N={r['N']}"
 def _under_tile(M, K, N, tile_M, tile_K, tile_N) -> bool:
    return M < tile_M or K < tile_K or N < tile_N
 def _xtick_labels(shape_labels, tile_counts, flagged) -> list[str]:
    out = []
    for lbl, tc, fl in zip(shape_labels, tile_counts, flagged):
        s = f"{lbl}\n({tc} tiles)"
        if fl:
            s += " *"
        out.append(s)
    return out
 def _grouped_bar_png(
    out_name: str, *, title: str, subtitle: str | None,
    shape_labels, tile_counts, flagged, series: dict, colors: dict,
    y_label: str, threshold: float | None = None, footnote: str | None = None,
 ) -> str:
    """Render one grouped-bar chart to GEMM_PLOTS_DIR/out_name; return the path."""
    import matplotlib.pyplot as plt
    import numpy as np
    n_groups = len(shape_labels)
    n_series = max(1, len(series))
    x = np.arange(n_groups)
    width = 0.8 / n_series
    fig, ax = plt.subplots(figsize=(11, 6))
    for i, (name, vals) in enumerate(series.items()):
        offset = (i - (n_series - 1) / 2) * width
        ax.bar(x + offset, vals, width, label=name, color=colors.get(name))
    ax.set_xticks(x)
    ax.set_xticklabels(
        _xtick_labels(shape_labels, tile_counts, flagged), fontsize=8,
    )
    ax.set_ylabel(y_label)
    ax.set_title(title, fontsize=13, fontweight="bold")
    if subtitle:
        ax.text(0.5, 1.01, subtitle, transform=ax.transAxes, ha="center",
                va="bottom", fontsize=8, color="#475569")
    if threshold is not None:
        ax.axhline(threshold, ls="--", color="gray", lw=1.0)
    ax.legend(fontsize=8, loc="upper right")
    ax.grid(True, axis="y", alpha=0.3)
    caption = "* = under-tile shape (M<TILE_M, K<TILE_K, or N<TILE_N)"
    if footnote:
        caption = footnote + "\n" + caption
    fig.text(0.5, 0.01, caption, ha="center", fontsize=7, color="gray",
             wrap=True)
    fig.tight_layout(rect=(0, 0.05, 1, 1))
    GEMM_PLOTS_DIR.mkdir(parents=True, exist_ok=True)
    out = GEMM_PLOTS_DIR / out_name
    fig.savefig(out, dpi=120)
    plt.close(fig)
    return str(out)
 # ── individual chart renderers (read sweep JSON, emit one PNG each) ─────
 def emit_stage_breakdown() -> str | None:
    """Per-stage engine wall-clock per shape (load_ref operand staging)."""
    data = _load_sweep_data()
    rows = [r for r in data["rows"] if r.get("variant") == _PLOT_VARIANT]
    if not rows:
        return None
    tile = data["tile_sizes"]
    shape_labels = [_shape_label(r) for r in rows]
    flagged = [_under_tile(r["M"], r["K"], r["N"], tile["M"], tile["K"], tile["N"])
               for r in rows]
    tile_counts = [r["tile_count_expected"] for r in rows]
    series = {
        STAGE_DISPLAY[s]: [r.get("stages", {}).get(s, {}).get("wall_ns", 0.0)
                           for r in rows]
        for s in STAGE_KEYS
    }
    colors = {STAGE_DISPLAY[s]: STAGE_COLORS[s] for s in STAGE_KEYS}
    return _grouped_bar_png(
        "gemm_stage_breakdown.png",
        title="GEMM stage breakdown",
        subtitle=(f"Per-stage engine wall-clock (DMA in / Fetch / GEMM / "
                  f"DMA out), {_PLOT_VARIANT} staging. "
                  f"Tile {tile['M']}x{tile['K']}x{tile['N']}."),
        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
        series=series, colors=colors, y_label="ns",
        footnote="Bars = engine wall-clock interval (merged overlaps).",
    )
 def emit_mac_utilization_measured() -> str | None:
    """GEMM util % and useful pipeline-eff % (analytical model, load_ref)."""
    data = _load_sweep_data()
    rows = data["rows"]
    if not rows:
        return None
    tile = data["tile_sizes"]
    TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
    tile_flops = 2 * TILE_M * TILE_K * TILE_N
    dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
    head_ns = (_D_STAGES - 1) * _T_STAGE
    by_shape = {(r["M"], r["K"], r["N"]): r
                for r in rows if r["variant"] == _PLOT_VARIANT}
    shapes = list(by_shape)
    if not shapes:
        return None
    shape_labels = [_shape_label(by_shape[k]) for k in shapes]
    flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
    tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
    gemm_util, useful_eff = [], []
    for k in shapes:
        r = by_shape[k]
        M, K, N = r["M"], r["K"], r["N"]
        useful = 2 * M * K * N
        tiles = r["tile_count_expected"]
        gu = useful / (tile_flops * tiles) * 100
        gemm_util.append(gu)
        m_tiles = (M + TILE_M - 1) // TILE_M
        n_tiles = (N + TILE_N - 1) // TILE_N
        n_mn = m_tiles * n_tiles
        compute_total = tiles * _T_STAGE
        wall = head_ns + tiles * _T_STAGE + max(0, n_mn - 1) * dma_w_per_pair
        ueff = (compute_total * (gu / 100.0) / wall) * 100 if wall > 0 else 0.0
        useful_eff.append(ueff)
    series = {"GEMM util %": gemm_util, "Useful eff %": useful_eff}
    colors = {"GEMM util %": "#10B981", "Useful eff %": "#F59E0B"}
    return _grouped_bar_png(
        "gemm_mac_utilization_measured.png",
        title="GEMM MAC utilization — load_ref",
        subtitle=("GEMM util = useful FLOPs / (tile FLOPs x tiles); "
                  "Useful eff = GEMM util x ideal pipeline efficiency."),
        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
        series=series, colors=colors, y_label="%", threshold=100.0,
        footnote="Theoretical ideal-pipeline model (not simulator data).",
    )
 def emit_mac_utilization_theoretical_vs_measured() -> str | None:
    """Theoretical vs simulator-measured GEMM util / useful eff (load_ref)."""
    data = _load_sweep_data()
    rows = data["rows"]
    if not rows:
        return None
    tile = data["tile_sizes"]
    TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
    tile_flops = 2 * TILE_M * TILE_K * TILE_N
    dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
    head_ns = (_D_STAGES - 1) * _T_STAGE
    peak_per_ns = tile_flops / _T_STAGE
    by_shape = {(r["M"], r["K"], r["N"]): r
                for r in rows if r["variant"] == _PLOT_VARIANT}
    shapes = list(by_shape)
    if not shapes:
        return None
    shape_labels = [_shape_label(by_shape[k]) for k in shapes]
    flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
    tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
    gu_t, gu_m, eff_t, eff_m = [], [], [], []
    for k in shapes:
        r = by_shape[k]
        M, K, N = r["M"], r["K"], r["N"]
        useful = 2 * M * K * N
        tiles = r["tile_count_expected"]
        gut = useful / (tile_flops * tiles)
        gu_t.append(gut * 100)
        rec = r.get("stages", {}).get("GEMM", {}).get("record_count", 0) or tiles
        gu_m.append((useful / (tile_flops * rec) * 100) if rec else 0.0)
        m_tiles = (M + TILE_M - 1) // TILE_M
        n_tiles = (N + TILE_N - 1) // TILE_N
        n_mn = m_tiles * n_tiles
        compute_total = tiles * _T_STAGE
        wall_t = head_ns + compute_total + max(0, n_mn - 1) * dma_w_per_pair
        eff_t.append((compute_total * gut / wall_t * 100) if wall_t > 0 else 0.0)
        cw = r.get("composite_window_ns", 0.0) or 0.0
        eff_m.append((useful / cw / peak_per_ns * 100) if cw > 0 else 0.0)
    series = {
        "GEMM util % (theoretical)": gu_t,
        "GEMM util % (measured)":    gu_m,
        "Theoretical eff %":         eff_t,
        "Measured eff %":            eff_m,
    }
    colors = {
        "GEMM util % (theoretical)": "#10B981",
        "GEMM util % (measured)":    "#6EE7B7",
        "Theoretical eff %":         "#F59E0B",
        "Measured eff %":            "#3B82F6",
    }
    return _grouped_bar_png(
        "gemm_mac_utilization_theoretical_vs_measured.png",
        title="GEMM MAC utilization — theoretical vs measured (load_ref)",
        subtitle=("theoretical model vs simulator op_log; agreement "
                  "validates the analytical pipeline model."),
        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
        series=series, colors=colors, y_label="%", threshold=100.0,
    )
 def emit_all_gemm_plots() -> list[str]:
    """Render every GEMM figure that has data; return the list of paths written."""
    paths = []
    for fn in (emit_stage_breakdown,
               emit_mac_utilization_measured,
               emit_mac_utilization_theoretical_vs_measured):
        p = fn()
        if p:
            paths.append(p)
    return paths
@@ -0,0 +1,77 @@
 """Milestone benches: registration + figure/result generation (ADR-0054).
 ``milestone-1h-gemm`` / ``milestone-1h-ccl`` are eval benches: run via the
 normal ``run_bench`` path, they regenerate every GEMM / allreduce figure +
 CSV into ``benches/1H_milestone_output/{gemm,ccl}/``. The GEMM bench in
 ``MILESTONE_FAST=1`` mode just re-renders the committed sweep JSON (fast,
 default-run here); the CCL bench drives both full sweeps (slow, opt-in).
 """
 from __future__ import annotations
 import re
 from pathlib import Path
 import pytest
 from kernbench.benches.registry import resolve
 from kernbench.runtime_api.bench_runner import run_bench
 from kernbench.runtime_api.types import resolve_device
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 import kernbench.benches.milestone_1h_ccl as ccl_bench
 import kernbench.benches.milestone_1h_gemm as gemm_bench
 _NAME_RE = re.compile(r"^[a-z][a-z0-9]*(-[a-z0-9]+)*$")
 def _run(name: str):
    topo = resolve_topology("topology.yaml")
    return run_bench(
        topology=topo, bench_fn=resolve(name).run, device=resolve_device(None),
        engine_factory=lambda t, d: GraphEngine(
            getattr(t, "topology_obj", t), enable_data=True,
        ),
    )
 def test_milestone_benches_registered():
    for name in ("milestone-1h-gemm", "milestone-1h-ccl"):
        spec = resolve(name)
        assert spec.name == name
        assert _NAME_RE.match(spec.name)
        assert spec.description.strip()
@pytest.mark.skipif(
    not gemm_bench.DEFAULT_SWEEP_JSON.exists(),
    reason="gemm_sweep.json absent; run scripts/gemm_sweep.py first",
 )
 def test_milestone_gemm_fast_generates_figures(monkeypatch):
    monkeypatch.setenv("MILESTONE_FAST", "1")
    result = _run("milestone-1h-gemm")
    assert result.completion.ok, result.completion
    out = gemm_bench._OUTPUT_DIR
    for png in (
        "gemm_stage_breakdown.png",
        "gemm_mac_utilization_measured.png",
        "gemm_mac_utilization_theoretical_vs_measured.png",
    ):
        assert (out / png).exists(), f"missing {png}"
@pytest.mark.slow
 def test_milestone_ccl_generates_figures():
    result = _run("milestone-1h-ccl")
    assert result.completion.ok, result.completion
    out = ccl_bench._OUTPUT_DIR
    for artifact in (
        "summary.csv",
        "topology.png",
        "comparison_mesh_vs_ring_vs_2DTorus_vs_theoretical_vs_fsim.png",
        "AllReduce_LRAB_2Dtorus_6SiP_2x3_with_TCM_SRAM_HBM.png",
        "AllReduce_LRAB_Ring1D_6SiP_1x6.png",
        "AllReduce_LRAB_2Dtorus_6SiP_2x3.png",
        "AllReduce_LRAB_2DMesh_6SiP_2x3.png",
    ):
        assert (out / artifact).exists(), f"missing {artifact}"
@@ -93,6 +93,7 @@ CLASSIFICATION: dict[int, tuple[str, str | None]] = {
    51: (IMPL_DECISIONS, "Routing & Helper API"),
    52: (IMPL_DECISIONS, "Sim-engine Op Log and Memory Store Schemas"),
    53: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
    54: (IMPL_DECISIONS, "Evaluation Harnesses"),
 }
 # Canonical component order for the Detailed Architecture section.