eval: fold GEMM/allreduce harnesses into self-contained milestone benches

Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/ into two self-contained eval benches so a user can regenerate every result + figure with one command: kernbench run --bench milestone-1h-gemm (MILESTONE_FAST=1 reuses JSON) kernbench run --bench milestone-1h-ccl - benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the run(torch) entry drives the sweeps and writes figures into benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a sentinel tensor to satisfy the run_bench contract. - tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin re-export/wrapper shims over the benches (single source preserved); the pytest-only param builders + _run_distributed wrapper stay in the shim. - eval-bench pattern: a bench may drive many configs + build its own per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2). ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI Semantics amended; ADR INDEX regenerated. Verified: milestone benches run clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 15:19:32 -07:00
parent e33e76f2d1
commit cc1bbd0ab7
19 changed files with 2189 additions and 1465 deletions
@@ -7,6 +7,11 @@ Accepted
 `tests/sccl/` 평가 하니스를 문서화한다; 구현과 대조 검증 완료
 (상수, 파일 집합, 스윕 차원을 교차 확인).

+**ADR-0054로 개정됨**: 드라이버 코어, sweep, renderer가 `milestone-1h-ccl`
+bench(단일 home)로 이동했다; `tests/sccl/_allreduce_helpers.py`는 이제 거기서
+re-export한다(pytest 전용 param 빌더 + `_run_distributed` wrapper는 로컬
+유지). figure 테스트는 변경 없음.
+
 ## Context

 ADR-0032는 intercube all-reduce *알고리즘*을 정의하고, ADR-0023/0024/0027은
@@ -8,6 +8,12 @@ GEMM 평가/특성화 하니스를 문서화한다; 구현과 대조 검증 완
 (상수, tile 크기, figure 집합, script↔test 분할을 교차 확인). D5/D6
 caveat은 부정확이 아니라 기록된 한계다.

+**ADR-0054로 개정됨**: sweep + renderer가 `milestone-1h-gemm` bench(단일
+home)로 이동했다; `scripts/gemm_sweep.py`와 `tests/gemm/`는 이제 거기서
+re-export한다. D1/D2의 "데이터 생성은 수동 script / 무거운 작업은 opt-in"은
+평가-bench 패턴으로 대체된다(하나의 bench가 전부 재생성;
+`MILESTONE_FAST=1`은 committed JSON 재사용).
+
 ## Context

 ADR-0014(PE pipeline)와 ADR-0042(tile-plan generator)는 GEMM *구현*을
@@ -10,6 +10,10 @@ Accepted (2026-05-21).
 **bench가 어떻게 등록되고 어떤 함수 시그너처를 따라야 하는가**는 ADR 레벨에
 없었음.

+**ADR-0054로 확장됨**: D5의 단일 구성 규칙에 세 번째 패턴이 추가된다 —
+*평가 bench*(예: `milestone-1h-*`)는 여러 구성을 구동하고, 구성별 자체 엔진을
+빌드하며, D4를 만족시키기 위해 sentinel 텐서를 제출한다.
+
 ## First action (제일 처음에 하는 일)

 `kernbench.benches` 패키지가 임포트되면 `__init__.py` 가 즉시
@@ -0,0 +1,137 @@
+# ADR-0054: 마일스톤 평가 bench — 자기완결적 sweep + figure bench
+
+## Status
+
+Accepted (2026-05-22).
+
+ADR-0044(D1/D2)와 ADR-0045(D5)를 개정하고, ADR-0043/0044의 "로직이
+`scripts/` + `tests/`에 산다" 배치를 대체한다: GEMM/allreduce 평가
+하니스가 이제 사용자가 실행하여 모든 결과 + figure를 재생성하는
+자기완결적 **bench**가 된다.
+
+## Context
+
+ADR-0043(allreduce 평가)과 ADR-0044(GEMM 평가)는 각 하니스를 **sweep**
+(수동 `scripts/` 드라이버, 또는 allreduce의 경우 parametrized 테스트
+자체) + committed 데이터를 렌더링하는 **figure 테스트**로 분리했다.
+따라서 sweep/render 로직은 `scripts/gemm_sweep.py`,
+`tests/gemm/_gemm_plot_helpers.py`, `tests/sccl/_allreduce_helpers.py`에
+존재했다.
+
+마일스톤 요구사항("사용자가 *하나의 bench*를 실행해 모든 결과와 플롯을
+생성하도록 allreduce + GEMM 평가를 리팩터")은 그 배치로는 충족 불가다:
+bench는 production 코드이며 **`tests/`를 import할 수 없다**(ADR-0007 레이어
+방향). 평가 로직은 bench에서 닿을 수 있도록 production으로 이동해야 했다.
+
+선택한 home은 별도 `kernbench.eval` 패키지가 아니라 bench 모듈 자체다.
+bench 파일은 임의의 모듈 레벨 코드를 가질 수 있으며, 하니스를 bench로
+합치면 도메인당 파일 하나가 유지되고 패키지 레이어가 하나 줄어든다.
+
+## Decision
+
+### D1. 두 마일스톤 bench가 평가 로직을 보유
+
+- `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep
+  + 세 figure renderer(`scripts/gemm_sweep.py` +
+  `tests/gemm/_gemm_plot_helpers.py`에서 이동).
+- `src/kernbench/benches/milestone_1h_ccl.py` — distributed allreduce
+  드라이버, latency + buffer-kind sweep, topology diagram, FSIM 비교, 그리고
+  direct-launch 패리티 레퍼런스(`tests/sccl/_allreduce_helpers.py`에서 이동).
+
+각 파일은 해당 도메인 평가 로직의 **단일 home**이다.
+
+### D2. "평가 bench" 패턴 (ADR-0045 D5 확장)
+
+ADR-0045 D5는 bench를 단일 구성(single-SIP, 또는 ADR-0024 multi-SIP CCL
+예외)으로 고정했다. 본 ADR은 세 번째 패턴을 추가한다:
+
+- **평가 bench**는 *여러* 구성을 구동하고 figure를 렌더링할 수 있다. 외부
+  `run_bench` 엔진 대신 sweep 지점마다 자체 `GraphEngine` /
+  `RuntimeContext`를 빌드한다.
+- 그러면 외부 ctx에 제출된 handle이 없으므로, bench는 마지막에
+  **sentinel 텐서**(`torch.zeros((1, 1), …)`)를 제출하여 `run_bench`의
+  "최소 한 번 제출" 계약(ADR-0045 D4)을 만족시키고 CLI가 0으로 종료되게
+  한다.
+
+### D3. 출력 위치
+
+두 bench 모두 `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`에
+쓴다(사용자 요청 — bench 옆 아티팩트). 디렉터리는 생성된 PNG/CSV/JSON만
+보유하며(`.py`/`__init__.py` 없음), 따라서 eager-import audit(ADR-0045
+첫 동작)이 무시한다 — `pkgutil.iter_modules`는 비-패키지 하위 디렉터리를
+yield하지 않는다. committed `docs/diagrams/` 아티팩트와 달리
+**git-ignore**된다(요청 시 재생성 가능).
+
+### D4. GEMM 무거운 sweep — 기본은 fresh, `MILESTONE_FAST`로 재사용
+
+`milestone-1h-gemm`은 기본적으로 전체 24-sim sweep을 실행한다(분 단위;
+한 shape는 2048 tile). `MILESTONE_FAST=1`은 committed
+`docs/diagrams/gemm_sweep.json`을 재사용하고 렌더링만 한다(초 단위). 이는
+ADR-0044 D1/D2의 "무거운 sweep은 수동/`slow` 단계로 유지"를 뒤집는다:
+bench 실행이 곧 재생성이다. slow 경로는 `@pytest.mark.slow` bench
+테스트로 행사되고, fast 경로는 기본 실행된다.
+
+### D5. 테스트 + 스크립트는 thin re-export shim으로 재사용 (단일 home 유지)
+
+기존 figure 테스트와 `scripts/gemm_sweep.py` 진입점은 유지되며 이제 bench
+모듈을 재사용한다:
+
+- `tests/gemm/_gemm_plot_helpers.py` → renderer +
+  `GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT`를
+  `kernbench.benches.milestone_1h_gemm`에서 re-export.
+- `tests/sccl/_allreduce_helpers.py` → 드라이버 코어, config writer, sweep
+  상수, renderer, disk aggregator를 `kernbench.benches.milestone_1h_ccl`에서
+  re-export하고, **pytest 전용** 조각은 로컬 유지: `pytest.param` 행렬
+  (`CONFIGS` / `_sweep_params` / `_bk_params`)과 fixture 결합
+  `_run_distributed`(`monkeypatch.chdir` + `_drive_distributed`) wrapper.
+- `scripts/gemm_sweep.py` → bench의 `run_sweep` 위 thin wrapper.
+
+테스트가 bench 모듈을 import하는 것은 허용된다(테스트는 production 위에
+위치, ADR-0007); 이는 전체 패키지 eager audit을 유발하며, 그것은 이미 매
+`kernbench` 실행 시 동작한다. matplotlib는 renderer 내부에서 lazy import로
+유지되어 audit의 startup 비용은 불변이다.
+
+### D6. 평면 모듈 네이밍 (`benches/` 하위 폴더 없음)
+
+`1H_milestone…`로 명명된 `benches/` 하위 패키지는 불가능하다 — Python
+패키지 이름은 숫자로 시작할 수 없다. 따라서 bench는 평면 모듈
+`milestone_1h_gemm.py` / `milestone_1h_ccl.py`이며 bench 이름은
+`milestone-1h-gemm` / `milestone-1h-ccl`(kebab-case, ADR-0045 D1에 따라
+글자로 시작)이다.
+
+## Consequences
+
+### Positive
+
+- `kernbench run --bench milestone-1h-gemm`(또는 `…-ccl`)이 도메인의 모든
+  결과 + figure를 한 명령으로 재생성한다 — 마일스톤 요구사항.
+- 평가 로직의 단일 소스(bench), shim을 통해 테스트와 스크립트가 재사용;
+  중복 없음.
+- figure 테스트와 `scripts/gemm_sweep.py`는 변경 없이 계속 동작.
+
+### Negative / limitations
+
+- 두 bench 파일이 크다(CCL 쪽은 distributed 드라이버, sweep, matplotlib
+  드로잉을 섞는다). 대부분 평가 하니스인 "bench"는 이례적이며, 본 ADR이
+  이를 정당화한다.
+- 생성 아티팩트가 명시적 요청에 의해 source tree(`src/kernbench/benches/`)
+  안에 산다; 커밋을 피하려 git-ignore.
+- `milestone-1h-ccl`(및 기본 `milestone-1h-gemm`)은 분 단위 소요 —
+  on-demand 마일스톤 아티팩트에는 수용 가능, 일상 실행에는 아님.
+
+## Dependencies
+
+- **ADR-0007**: 레이어 방향(테스트는 production을 import할 수 있으나 bench는
+  테스트를 import할 수 없는 이유).
+- **ADR-0043 / ADR-0044**: 본 ADR이 bench로 이전하는 allreduce / GEMM 평가
+  하니스.
+- **ADR-0045**: bench 모듈 계약; 여기 D2가 그 D5(single-device 규칙)를
+  평가-bench 패턴으로 확장하고, sentinel을 위해 D4(NO_REQUESTS)에 의존.
+- **ADR-0024**: allreduce sweep이 구동하는 rank = SIP launcher.
+
+## Open questions
+
+- GEMM theoretical 모델 상수(ADR-0044 D5)를 복사 대신 ADR-0033/0014에서
+  소싱해야 하는가? 본 ADR로는 불변.
+- `build_overview_slides.py`가 GEMM 막대를 네이티브로 그리는 대신 마일스톤
+  출력 PNG를 소비해야 하는가? 여전히 open(ADR-0044 D6 / Negative).
@@ -1,6 +1,6 @@
 # ADR Index

-Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **46**.
+Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **47**.

 Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.

@@ -152,6 +152,7 @@ One subsection per component file under `src/kernbench/components/builtin/`.

 - [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce 평가 하니스 — `tests/sccl/`
 - [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM 평가 하니스 — `scripts/gemm_sweep.py` + `tests/gemm/`
+- [ADR-0054](./ADR-0054-eval-milestone-benches.md) — 마일스톤 평가 bench — 자기완결적 sweep + figure bench

 ### Bench Module Contract

@@ -7,6 +7,11 @@ Accepted
 Documents the `tests/sccl/` evaluation harness; verified against the
 implementation (constants, file set, and sweep dimensions cross-checked).

+**Amended by ADR-0054**: the driver core, sweeps, and renderers moved into
+the `milestone-1h-ccl` bench (single home); `tests/sccl/_allreduce_helpers.py`
+now re-exports from it (keeping the pytest-only param builders +
+`_run_distributed` wrapper local). The figure tests are unchanged.
+
 ## Context

 ADR-0032 defines the intercube all-reduce *algorithm*; ADR-0023/0024/0027
@@ -9,6 +9,12 @@ implementation (constants, tile sizes, figure set, and the script↔test
 split cross-checked). The D5/D6 caveats are recorded limitations, not
 inaccuracies.

+**Amended by ADR-0054**: the sweep + renderers moved into the
+`milestone-1h-gemm` bench (single home); `scripts/gemm_sweep.py` and
+`tests/gemm/` now re-export from it. D1/D2's "data generation stays a manual
+script / heavy work is opt-in" is superseded by the eval-bench pattern (one
+bench regenerates everything; `MILESTONE_FAST=1` reuses the committed JSON).
+
 ## Context

 ADR-0014 (PE pipeline) and ADR-0042 (tile-plan generators) define the GEMM
@@ -10,6 +10,10 @@ module must follow. ADR-0010 (CLI surface) specifies the `kernbench
 list/run` interface, but **how benches are registered and what signature
 they must follow** had no ADR-level coverage.

+**Extended by ADR-0054**: D5's single-config rule gains a third pattern —
+the *eval bench* (e.g. `milestone-1h-*`) drives many configs, builds its
+own per-config engines, and submits a sentinel tensor to satisfy D4.
+
 ## First action

 When `kernbench.benches` is imported, `__init__.py` immediately calls
@@ -0,0 +1,141 @@
+# ADR-0054: Milestone Eval Benches — self-contained sweep + figure benches
+
+## Status
+
+Accepted (2026-05-22).
+
+Amends ADR-0044 (D1/D2) and ADR-0045 (D5) and supersedes the "logic lives
+in `scripts/` + `tests/`" arrangement of ADR-0043/0044: the GEMM and
+allreduce evaluation harnesses are now self-contained **benches** that a
+user runs to regenerate every result + figure.
+
+## Context
+
+ADR-0043 (allreduce eval) and ADR-0044 (GEMM eval) split each harness into
+a **sweep** (a manual `scripts/` driver, or — for allreduce — the
+parametrized tests themselves) plus **figure tests** that render committed
+data. The sweep/render logic therefore lived under `scripts/gemm_sweep.py`,
+`tests/gemm/_gemm_plot_helpers.py`, and `tests/sccl/_allreduce_helpers.py`.
+
+A milestone requirement ("refactor allreduce + GEMM evaluation so a user
+can run *one bench* to generate all the results and plots") cannot be met
+by that layout: a bench is production code and **must not import from
+`tests/`** (ADR-0007 layer direction). The eval logic had to move into
+production, reachable from a bench.
+
+The chosen home is the bench module itself — not a separate
+`kernbench.eval` package. A bench file may contain arbitrary module-level
+code; collapsing the harness into the bench keeps one file per domain and
+avoids an extra package layer.
+
+## Decision
+
+### D1. Two milestone benches own the eval logic
+
+- `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep +
+  the three figure renderers (moved from `scripts/gemm_sweep.py` +
+  `tests/gemm/_gemm_plot_helpers.py`).
+- `src/kernbench/benches/milestone_1h_ccl.py` — the distributed allreduce
+  driver, latency + buffer-kind sweeps, topology diagram, FSIM comparison,
+  and the direct-launch parity reference (moved from
+  `tests/sccl/_allreduce_helpers.py`).
+
+Each file is the **single home** for its domain's eval logic.
+
+### D2. The "eval bench" pattern (extends ADR-0045 D5)
+
+ADR-0045 D5 fixed a bench to a single configuration (single-SIP, or the
+ADR-0024 multi-SIP CCL exception). This ADR adds a third pattern:
+
+- An **eval bench** may drive *many* configurations and render figures. It
+  builds its own per-config `GraphEngine` / `RuntimeContext` instances
+  (one per sweep point) rather than using the outer `run_bench` engine.
+- Because the outer ctx then has no submitted handles, the bench submits a
+  **sentinel tensor** (`torch.zeros((1, 1), …)`) at the end to satisfy
+  `run_bench`'s "must submit at least one request" contract (ADR-0045 D4),
+  so the CLI exits 0.
+
+### D3. Output location
+
+Both benches write to `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`
+(per user request — artifacts beside the bench). The directory holds only
+generated PNG/CSV/JSON (never a `.py`/`__init__.py`), so the eager-import
+audit (ADR-0045 first action) ignores it — `pkgutil.iter_modules` does not
+yield non-package subdirectories. It is **git-ignored** (regenerable on
+demand), unlike the committed `docs/diagrams/` artifacts.
+
+### D4. GEMM heavy sweep — fresh by default, `MILESTONE_FAST` to reuse
+
+`milestone-1h-gemm` runs the full 24-sim sweep by default (minutes; one
+shape is 2048 tiles). `MILESTONE_FAST=1` reuses the committed
+`docs/diagrams/gemm_sweep.json` and only re-renders (seconds). This
+reverses ADR-0044 D1/D2's "heavy sweep stays a manual/`slow`-marked step":
+running the bench *is* the regeneration. The slow path is exercised by a
+`@pytest.mark.slow` bench test; the fast path runs by default.
+
+### D5. Tests + script reuse via thin re-export shims (single home kept)
+
+The pre-existing figure tests and the `scripts/gemm_sweep.py` entry point
+are retained and now reuse the bench modules:
+
+- `tests/gemm/_gemm_plot_helpers.py` → re-exports the renderers +
+  `GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT` from
+  `kernbench.benches.milestone_1h_gemm`.
+- `tests/sccl/_allreduce_helpers.py` → re-exports the driver core, config
+  writers, sweep constants, renderers, and disk aggregators from
+  `kernbench.benches.milestone_1h_ccl`, and keeps the **pytest-only** pieces
+  local: the `pytest.param` matrices (`CONFIGS` / `_sweep_params` /
+  `_bk_params`) and the fixture-coupled `_run_distributed`
+  (`monkeypatch.chdir` + `_drive_distributed`) wrapper.
+- `scripts/gemm_sweep.py` → thin wrapper over the bench's `run_sweep`.
+
+Tests importing a bench module is permitted (tests sit above production,
+ADR-0007); it triggers the whole-package eager audit, which already runs on
+every `kernbench` invocation. matplotlib stays lazily imported inside the
+renderers, so the audit's startup cost is unchanged.
+
+### D6. Flat module naming (no `benches/` subfolder)
+
+A `benches/` subpackage named `1H_milestone…` is impossible — a Python
+package name cannot start with a digit. The benches are therefore flat
+modules `milestone_1h_gemm.py` / `milestone_1h_ccl.py` with bench names
+`milestone-1h-gemm` / `milestone-1h-ccl` (kebab-case, letter-first per
+ADR-0045 D1).
+
+## Consequences
+
+### Positive
+
+- `kernbench run --bench milestone-1h-gemm` (or `…-ccl`) regenerates all of
+  a domain's results + figures in one command — the milestone requirement.
+- Single source for the eval logic (the bench), reused by tests and the
+  script via shims; no duplication.
+- The figure tests and `scripts/gemm_sweep.py` keep working unchanged.
+
+### Negative / limitations
+
+- The two bench files are large (the CCL one mixes the distributed driver,
+  sweeps, and matplotlib drawing). A "bench" that is mostly an eval harness
+  is unusual; this ADR legitimizes it.
+- Generated artifacts live inside the source tree (`src/kernbench/benches/`)
+  by explicit request; git-ignored to avoid committing them.
+- `milestone-1h-ccl` (and the default `milestone-1h-gemm`) take minutes —
+  acceptable for an on-demand milestone artifact, not for routine runs.
+
+## Dependencies
+
+- **ADR-0007**: layer direction (why tests may import production but a bench
+  may not import tests).
+- **ADR-0043 / ADR-0044**: the allreduce / GEMM eval harnesses this ADR
+  relocates into benches.
+- **ADR-0045**: bench module contract; D2 here extends its D5 (single-device
+  rule) with the eval-bench pattern, and relies on D4 (NO_REQUESTS) for the
+  sentinel.
+- **ADR-0024**: rank = SIP launcher driven by the allreduce sweeps.
+
+## Open questions
+
+- Should the GEMM theoretical-model constants (ADR-0044 D5) be sourced from
+  ADR-0033/0014 rather than copied? Unchanged by this ADR.
+- Should `build_overview_slides.py` consume the milestone output PNGs
+  instead of drawing GEMM bars natively? Still open (ADR-0044 D6 / Negative).
@@ -1,6 +1,6 @@
 # ADR Index

-Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **46**.
+Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **47**.

 Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.

@@ -152,6 +152,7 @@ One subsection per component file under `src/kernbench/components/builtin/`.

 - [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce Evaluation Harness — `tests/sccl/`
 - [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/`
+- [ADR-0054](./ADR-0054-eval-milestone-benches.md) — Milestone Eval Benches — self-contained sweep + figure benches

 ### Bench Module Contract