adr: add ADR-0043/0044 (eval harnesses); reconcile ADR-0024/0032 for SIP w/h

Document the allreduce + GEMM evaluation harnesses and bring the affected allreduce ADRs in line with the refactored code. New (Accepted, EN + KO): - ADR-0043 — allreduce evaluation harness (tests/sccl/): distributed-driven correctness, latency/buffer-kind sweeps, sessionfinish plot aggregators, topology + FSIM-comparison figures. Verified against the implementation. - ADR-0044 — GEMM evaluation harness (scripts/gemm_sweep.py + tests/gemm/): heavy-script data gen vs. fast test-rendered figures, slow regenerator, the 3-figure set. Records two limitations as open questions: the theoretical-model constants are inherited (not yet traced to ADR-0033/ 0014), and the *_measured figure is a naming misnomer. Updated (EN + KO): - ADR-0024 — add D5: SIP grid w/h resolution (explicit sips.w/h, square fallback, fail-loud), documenting the AhbmCCLBackend fix. - ADR-0032 — D4/D5/Non-goals reconciled: rectangular SIP grids (e.g. 6 SIPs as 3x2) are supported via explicit w/h; the square requirement now applies only to the fallback. Affected-files repointed to tests/sccl/. Verification: ADR-0023 and ADR-0042 confirmed still matching the code (no change). verify_adr_lang_pairs.py passes (EN/KO Status blocks byte-equal). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gemm: test-generated GEMM plots under tests/gemm/ + docs/diagrams/gemm_plots/
2026-05-21 10:26:25 -07:00 · 2026-05-21 09:58:08 -07:00 · 2026-05-20 22:24:43 -07:00 · 2026-05-20 20:50:48 -07:00 · 2026-05-20 14:54:18 -07:00 · 2026-05-20 14:43:03 -07:00
101 changed files with 4101 additions and 1098 deletions
@@ -62,6 +62,10 @@ After writing the document, report to the user in the chat response:
 - **G2 gaps** — ADRs missing **Context** or **Decision**. Alternatives
  and Consequences are optional; their absence is NOT a gap.
 - **G3 gaps** — ADR cross-references without a back-reference.
  Only flag when the referencer's ADR number is **less than** the
  referenced ADR's number (older → newer). Newer ADRs citing older
  infrastructure ADRs (higher number → lower number) are expected to
  be one-way and are NOT flagged.
 - **G4 suggestions** — areas where an ADR seems missing based on the
  ADR corpus + SPEC reading. Phrase as suggestions, not findings. Each
  G4 item must say *why* it's suggested and remain falsifiable.
@@ -99,7 +103,10 @@ For each `docs/adr/ADR-NNNN-*.md`:
 - Record presence/absence of **Context** and **Decision** for G2.
  Alternatives and Consequences presence is recorded for use during
  authoring, but their absence is not a gap.
- Record ADR-NNNN cross-references for G3.
+- Record ADR-NNNN cross-references for G3, preserving the direction
  (referencer → referenced). G3 evaluation uses ADR numbers to
  distinguish older→newer (flagged when missing back-link) from
  newer→older (not flagged; see *Output Contract* G3).
 - Record Status (e.g., Accepted, Superseded, Draft) and any "supersedes
  ADR-NNNN" text in the body for G5a.
@@ -263,9 +270,11 @@ In **dry-run mode**, replace the `Wrote:` line with:
 - ADR-NNNN: missing <Context|Decision>
 - (or "none")
-**G3 — Broken cross-references**
+**G3 — Broken cross-references** (older → newer only)
- ADR-NNNN cites ADR-MMMM; ADR-MMMM does not back-reference
+- ADR-NNNN cites ADR-MMMM (NNNN < MMMM); ADR-MMMM does not back-reference
 - (or "none")
 - Note: newer ADRs citing older infrastructure ADRs (NNNN > MMMM) are
  not flagged here — one-way references are the expected pattern.
 **G4 — Suggested topics that may warrant a new ADR (verify before acting)**
 - <topic>: <why agent thinks it may be missing — must be falsifiable>
@@ -1,2 +0,0 @@
 def run(torch):
    print("IPCQ all reduce kernel bench")
@@ -1,40 +0,0 @@
 from __future__ import annotations
 import importlib
 from collections.abc import Callable
 from typing import Any
 from kernbench.runtime_api.context import RuntimeContext
 BenchFn = Callable[[RuntimeContext], Any]
 def _load_module(bench_id: str):
    bench_id = bench_id.strip()
    if not bench_id:
        raise ValueError("Bench id is empty.")
    module_path = f"benches.{bench_id}"
    try:
        return importlib.import_module(module_path)
    except ModuleNotFoundError as e:
        raise ValueError(
            f"Unknown bench '{bench_id}'. Expected module {module_path}.py"
        ) from e
 def resolve_bench(bench_id: str) -> BenchFn:
    """Resolve a bench id into its ``run(torch)`` callable.
    Expected layout (repo root):
        benches/<bench_id>.py
            def run(torch: RuntimeContext) -> Any
    """
    mod = _load_module(bench_id)
    run_fn = getattr(mod, "run", None)
    if run_fn is None:
        raise ValueError(
            f"Bench module benches.{bench_id} must define 'run(torch)'."
        )
    if not callable(run_fn):
        raise ValueError(f"'run' in benches.{bench_id} is not callable.")
    return run_fn
@@ -6,7 +6,7 @@
 defaults:
  # Algorithm to run for this benchmark execution.
-  algorithm: intercube_allreduce
+  algorithm: lrab_hierarchical_allreduce
  # IPCQ ring buffer location.
  #   tcm  — PE-local TCM (fast, small, conflicts with compute TCM access)
@@ -37,9 +37,14 @@ algorithms:
  # exchange on root cube, then broadcast back. SIP topology is read
  # from topology.yaml → system.sips.topology. Kernel auto-selects
  # ring / torus / mesh inter-SIP exchange pattern.
-  intercube_allreduce:
+  lrab_hierarchical_allreduce:
-    module: kernbench.ccl.algorithms.intercube_allreduce
+    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
    topology: none
    buffer_kind: tcm
    n_elem: 8
    # root_cube: the kernel currently elects the root dynamically as the
    # geometric center of the cube mesh (root = (h//2)*w + (w//2)) to
    # minimize the intra-SIP critical path, so this value is NOT read today.
    # Kept as a placeholder for a future explicit-root override / runtime
    # election hook (see ADR-0032 D1 + Non-goals).
    root_cube: 15
@@ -6,10 +6,11 @@ Accepted
 ## Context
-`kernbench` CLI는 시뮬레이터의 사용자 대면 진입점이다. 세 개의 서브명령을
+`kernbench` CLI는 시뮬레이터의 사용자 대면 진입점이다. 네 개의 서브명령을
 노출한다:
 - `run` — 토폴로지에 대해 벤치마크를 실행한다.
 - `list` — 등록된 벤치마크 목록을 출력한다.
 - `probe` — 레이턴시 / 대역폭 측정을 위한 진단 유틸리티.
 - `web` — 인터랙티브 토폴로지 뷰어.
@@ -33,8 +34,9 @@ Accepted
 - `--topology <path>`: 토폴로지 YAML 파일 경로. `resolve_topology()`를
  통해 로드된다.
- `--bench <name>`: 벤치마크 이름. `benches.loader.resolve_bench()`를
+- `--bench <identifier>`: 벤치마크 식별자. `kernbench.benches.registry.resolve()`를
-  통해 해석된다.
+  통해 해석되며, 등록된 kebab-case 이름(예: `gemm-single-pe`) 또는
  `kernbench list` 의 숫자 인덱스를 모두 받는다.
 선택 인자:
@@ -61,7 +63,22 @@ Accepted
 CLI는 여러 OS 프로세스나 독립된 시뮬레이션 실행을 생성하지 **않는다** —
 병렬성은 단일 시뮬레이션 인스턴스 내부에서 일어난다.
-### D4. `kernbench probe` — 레이턴시 / 대역폭 진단 유틸리티
+### D4. `kernbench list` — 등록된 벤치마크 목록 출력
 인자 없음. 각 등록된 벤치의 자동 부여된 인덱스, 등록된 이름,
 한 줄 설명을 출력한다.
 벤치는 `@bench(name=..., description=...)` 데코레이터
 (`kernbench.benches.registry`)를 통해 자기 자신을 등록한다.
 `kernbench.benches/` 아래의 언더스코어로 시작하지 않는 모든 모듈은
 반드시 최소 하나의 벤치를 등록해야 한다; 데코레이터가 누락되면
 패키지 import 시점에 `RuntimeError`가 발생한다.
 인덱스는 import 시점에 이름의 알파벳 순으로 부여된다. 인덱스는
 `--bench` 의 축약 표기를 위한 CLI 편의 기능이며 안정적인 API가
 아니다 — 알파벳 순으로 새 벤치가 끼면 이후 인덱스가 밀린다.
 ### D5. `kernbench probe` — 레이턴시 / 대역폭 진단 유틸리티
 필수 인자:
@@ -85,7 +102,7 @@ Probe는 추가로 단조성 불변식을 검증한다 — 예를 들어 local-H
 레이턴시 / 대역폭 모델을 검증하기 위한 개발자 도구이다; 벤치마크가
 아니다.
-### D5. `kernbench web` — 토폴로지 뷰어
+### D6. `kernbench web` — 토폴로지 뷰어
 선택 인자:
@@ -99,7 +116,7 @@ Probe는 추가로 단조성 불변식을 검증한다 — 예를 들어 local-H
 - `kernbench web`은 인터랙티브이다 — 팬/줌, 컴포넌트 속성 호버,
  SIP / CUBE / PE 뷰 간 전환.
-### D6. runtime API와 시뮬레이션 엔진은 디바이스 스코프를 유지한다
+### D7. runtime API와 시뮬레이션 엔진은 디바이스 스코프를 유지한다
 - runtime API 호출은 호출당 하나의 디바이스에서 동작한다.
 - 시뮬레이션 엔진은 모든 요청을 결정론적으로 스케줄링한다.
@@ -108,6 +125,9 @@ Probe는 추가로 단조성 불변식을 검증한다 — 예를 들어 local-H
 이 불변식은 각 레이어를 독립적으로 테스트 가능하게 유지한다; 디바이스
 열거와 다중 디바이스 팬아웃은 오직 CLI의 `run` 명령에만 존재한다(D3).
 `probe` 구현은 `kernbench.probes` 아래에 있다 (`kernbench.benches`와
 분리됨). 이는 probe가 등록된 벤치가 아니라 진단 유틸리티임을 반영한다.
 ## Consequences
 - 벤치마크 작성자는 단일 디바이스 로직을 작성한다; 다중 디바이스 동작은
@@ -168,6 +168,36 @@ placement = resolve_dp_policy(
 Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
 좌표를 직접 보유. ShardSpec 상세는 ADR-0026.
 ### D5. SIP 그리드 크기 — 명시적 `sips.w/h` 해석
 2D inter-SIP topology (`torus_2d`, `mesh_2d_no_wrap`)의 SIP 그리드 형태
 (width × height)는 `system.sips.w` / `system.sips.h`에서 해석한다. D1이
 `sips.count`로 `world_size`를 해석하는 것과 같은 방식이다. 우선순위:
 명시적 `w/h` (`w*h == count` 검증) > 정사각 fallback
 (`w/h` 미지정 시에만 `round(sqrt(count))²`) > error.
 ```python
 sips = spec.get("system", {}).get("sips", {})
 if sip_topo == "ring_1d":
    w, h = 0, 0                          # 1D sentinel (no grid)
 elif sips.get("w") is not None and sips.get("h") is not None:
    w, h = int(sips["w"]), int(sips["h"])
    if w * h != n_sips:
        raise ValueError(f"sip layout {w}x{h} != sips.count ({n_sips})")
 else:
    side = int(round(math.sqrt(n_sips)))
    if side * side != n_sips:
        raise ValueError("non-square sips.count requires explicit sips.w/h")
    w, h = side, side
 ```
 이로써 2D SIP 그리드가 완전 정사각이어야 한다는 기존 가정을 제거한다:
 6-SIP `torus_2d` / `mesh_2d_no_wrap`은 이제 `w: 3, h: 2`(또는 `2x3`)로
 표현 가능하다. 도출된 `(w, h)`는 알고리즘의 inter-SIP exchange로 전달된다
 (ADR-0032 D5에서 소비). 이전 코드 경로는 ring이 아닌 모든 topology에서
 `round(sqrt(count))²`를 조용히 취해 잘못된 그리드(예: 6 SIP에 2×2)를
 만들었다. fail-loud fallback을 갖춘 명시적 `w/h` 경로가 이를 대체한다.
 ---
 ## Dependencies
@@ -31,7 +31,7 @@ pe0만의 same-lane 큐브 간 reduce**, 그 다음 루트 큐브에서 SIP 간
 ### 현재 상태
- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — 커널
+- `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` — 커널
 - `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
 - `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend`가
  `init_process_group` 시점에 자동으로 와이어링한다.
@@ -42,29 +42,46 @@ pe0만의 same-lane 큐브 간 reduce**, 그 다음 루트 큐브에서 SIP 간
 ## Decision
-### D1. 알고리즘 구조 — 5단계
+### D1. 알고리즘 구조 — 5단계 (center-root, 양방향)
 루트 큐브는 큐브 메시의 기하학적 **중심**에 위치한다:
 ```
 root_col  = cube_w // 2
 root_row  = cube_h // 2
 root_cube = root_row * cube_w + root_col   # 중심; 4×4 메시에서 10
 ```
 각 reduce/broadcast 단계는 이 중심을 향해 **양방향으로** 수렴/발산하여,
 corner-root 워크 대비 SIP 내부 임계 경로를 절반으로 줄인다 (4×4 메시:
 reduce 4홉 + broadcast 4홉 vs SE-코너 루트의 6+6).
 각 SIP에 대해 (`mp.spawn`으로 동시에 launch):
 ```
-Phase 1 — Row reduce W → E (큐브 메시, pe0만):
+Phase 1 — col == root_col에서 수렴하는 Row reduce (큐브 메시, pe0만):
-    col=0이 E로 송신 → col=1이 누적, E로 송신 → ... → col=3이 row sum 보유.
+    좌측 절반(col < root_col)은 W→E로, 우측 절반(col > root_col)은
    E→W로 진행; root_col 큐브가 양쪽을 병합 → row sum 보유.
-Phase 2 — 최우측 열에서 Col reduce N → S (pe0, col = mesh_w-1):
+Phase 2 — col == root_col에서 row == root_row로 수렴하는 Col reduce:
-    row=0이 S로 송신 → row=1이 누적, S로 송신 → ... → 루트 큐브 (15)가
+    위쪽(row < root_row)은 N→S로, 아래쪽(row > root_row)은 S→N로 진행;
-    전체 SIP sum 보유.
+    루트 큐브가 양쪽을 병합 → 전체 SIP sum 보유.
-Phase 3 — 루트 큐브에서 SIP 간 교환 (루트 큐브의 pe0만):
+Phase 3 — cube_id == root_cube에서 SIP 간 교환 (pe0만):
    Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
    sip_topo_kind(topology.yaml의 sips.topology)로 선택.
-Phase 4 — 최우측 열에서 Col 브로드캐스트 S → N.
+Phase 4 — col == root_col에서 root_row로부터 바깥쪽으로 Col 브로드캐스트.
-Phase 5 — 큐브 메시 전반에 걸친 Row 브로드캐스트 E → W.
+Phase 5 — root_col로부터 바깥쪽으로 큐브 메시 전반에 Row 브로드캐스트.
 ```
 모든 단계가 끝나면 모든 큐브의 pe0이 전역 sum을 보유한다.
 **단일 큐브 fast-path**: `cube_w == cube_h == 1`(rank당 큐브 하나, 일반적인
 TP 케이스)인 경우 SIP 내부 reduce/broadcast 단계를 건너뛰고 곧바로
 Phase 3 SIP 간 교환으로 진행한다.
 커널은 `sip_topo_kind ∈ {0, 1, 2}`(ring_1d, torus_2d, mesh_2d_no_wrap)로
 파라미터화된 단일 함수이다. Phase 1-2와 4-5는 토폴로지 전반에서 동일하며,
 phase 3만 분기한다. 헬퍼 함수 `_inter_sip_ring`, `_inter_sip_torus_2d`,
@@ -118,21 +135,24 @@ system:
 ```
 - `ring_1d`: n_sips-1 라운드의 `send global_E / recv global_W`.
- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) 랩핑 메시. `global_E/W`에서
+- `torus_2d`: `w × h` 랩핑 메시. `global_E/W`에서 row ring, 이어서
-  row ring, 이어서 `global_S/N`에서 col ring.
+  `global_S/N`에서 col ring.
- `mesh_2d_no_wrap`: 랩어라운드 없는 정사각형 메시. 차원별 chain
+- `mesh_2d_no_wrap`: 랩어라운드 없는 `w × h` 메시. 차원별 chain
  reduce + 브로드캐스트.
-2D 변형은 `n_sips`가 완전 제곱수여야 한다.
+2D 그리드 크기 `(w, h)`는 `system.sips.w/h`에서 온다 (ADR-0024 D5).
 정사각 fallback (`round(sqrt(n_sips))²`)은 `w/h`가 생략된 경우에만
 적용되므로, 직사각형 그리드(예: 6 SIP을 `3×2`로)는 명시적 `w/h`로
 지원된다.
 ### D5. 프로세스-그룹 통합 — `AhbmCCLBackend`
 `init_process_group` 시점에 백엔드는:
 1. `ccl.yaml` + `topology.yaml`을 로드한다.
-2. 알고리즘 모듈의 `TOPO_NAME_TO_KIND`를 사용하여
+2. `system.sips.topology`로부터 알고리즘 모듈의 `TOPO_NAME_TO_KIND`를
-   `system.sips.topology`로부터 `sip_topo_kind, sip_topo_w, sip_topo_h`를
+   통해 `sip_topo_kind`를 도출하고, `sip_topo_w, sip_topo_h`는
-   도출한다.
+   `system.sips.w/h`에서 정사각 fallback과 함께 도출한다 (ADR-0024 D5).
 3. `configure_sfr_intercube_multisip(engine, spec, cfg)`를 호출한다 —
   일회성 SFR 와이어링, NCCL 커뮤니케이터 생성을 모방한다.
@@ -152,17 +172,19 @@ system:
 ```yaml
 defaults:
-  algorithm: intercube_allreduce
+  algorithm: lrab_hierarchical_allreduce
  buffer_kind: tcm
  ...
 algorithms:
-  intercube_allreduce:
+  lrab_hierarchical_allreduce:
-    module: kernbench.ccl.algorithms.intercube_allreduce
+    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
    topology: none
    buffer_kind: tcm
    n_elem: 8
-    root_cube: 15
+    root_cube: 15   # 현재 사용되지 않음 — 커널이 루트를 기하학적 중심으로
                    # 동적으로 선출한다 (D1 참조). 향후 명시적 루트 override /
                    # 런타임 선출 훅을 위한 placeholder로 유지한다.
 ```
 `topology.yaml`:
@@ -202,13 +224,16 @@ sip:
 - **PE별 allreduce** (큐브 내 PE-PE reduce). 범위 밖 — 본 알고리즘의
  워크로드는 큐브당 DP이다.
- **비대칭 SIP 토폴로지** (정사각형이 아닌 메시/토러스).
+- **정사각 그리드 fallback은 `n_sips = k²`를 요구**: 직사각형 SIP
-  `torus_2d`와 `mesh_2d_no_wrap`은 `n_sips = k²`를 요구한다.
+  그리드(정사각형이 아닌 메시/토러스)는 지원되지만, `system.sips.w/h`를
  명시적으로 줄 때만 가능하다 (ADR-0024 D5). `w/h` 생략 시 2D 토폴로지는
  정사각 그리드로 fallback하며 여전히 `n_sips = k²`를 요구한다.
 - **파이프라인 청크**: 큐브당 단일 타일, 아직 파이프라이닝 없음.
- **루트 큐브의 런타임 선출**: 커널은 현재 SE 코너로 하드코딩된
+- **루트 큐브의 런타임 선출**: 커널은 현재 SIP 내부 임계 경로를
-  `root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)`을 사용한다. SFR
+  최소화하기 위해 기하학적 중심인
-  와이어링이 모든 큐브를 커버하므로, 필요해질 때 런타임 선출은 순수
+  `root_cube = (mesh_h // 2) * mesh_w + (mesh_w // 2)`을 사용한다. SFR
-  커널 변경이다.
+  와이어링이 모든 큐브를 커버하므로, 필요해질 때 다른 루트를 런타임에
  선출하는 것은 순수 커널 변경이다.
 ---
@@ -241,15 +266,14 @@ sip:
 | File | Change |
 |---|---|
-| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (신규) | 커널 + `_inter_sip_*` 헬퍼 + `TOPO_NAME_TO_KIND` |
+| `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` (신규) | 커널 + `_inter_sip_*` 헬퍼 + `TOPO_NAME_TO_KIND` |
 | `src/kernbench/ccl/sfr_config.py` (신규) | `configure_sfr_intercube_multisip` |
 | `src/kernbench/ccl/topologies.py` | `torus_2d`, `mesh_2d_no_wrap` 추가 |
 | `src/kernbench/ccl/install.py` | `_OPPOSITE_DIR`을 `global_*` 쌍으로 확장 |
 | `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend`가 `configure_sfr_intercube_multisip` 사용 + sip_rank/topo 인자 추가 |
-| `ccl.yaml` | 단일 `intercube_allreduce` 항목 |
+| `ccl.yaml` | 단일 `lrab_hierarchical_allreduce` 항목 |
 | `topology.yaml` | `system.sips.topology` 추가 |
 | `benches/ccl_allreduce.py` | Row-wise 큐브-메시 텐서 레이아웃 |
-| `tests/test_allreduce_multidevice.py` (신규) | 구성 기반 ring/torus/mesh |
+| `tests/sccl/` (테스트 패키지) | 구성 기반 ring/torus/mesh 정확성 + 전체 `dist.all_reduce` 경로 + latency/buffer-kind 스윕 (평가 하니스 — ADR-0043) |
-| `tests/test_distributed_intercube_allreduce.py` (신규) | 전체 `dist.all_reduce` 경로 |
+| `tests/test_intercube_sfr_config.py` | SFR 와이어링 검증 |
 | `tests/test_intercube_sfr_config.py` (신규) | SFR 와이어링 검증 |
 | 제거 | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` 및 그 테스트 |
@@ -0,0 +1,133 @@
 # ADR-0038: PCIE_EP Component Model
 ## Status
 Accepted (2026-05-20).
 ADR-0035 (M_CPU), ADR-0036 (IO_CPU), ADR-0037 (Forwarding)
 와 같은 결의 컴포넌트-레벨 ADR.
 ## First action (제일 처음에 하는 일)
 `_inbox`에서 Transaction을 한 건 꺼내 `_forward_txn`을 통해 `run()`을 호출하고,
 그 안에서 `node.attrs["overhead_ns"]` 만큼 `env.timeout()`으로 PCIe 프로토콜
 처리 지연을 적용한다. 그 이후 시점부터는 일반 `ComponentBase` 워커가 정의한
 forwarding 규약을 따른다 (다음 hop이 있으면 `out_ports[next_hop].put(...)`,
 아니면 `drain_ns`를 소비하고 `txn.done.succeed()`).
 즉, **PCIE_EP의 첫 번째 일은 "PCIe 프로토콜 오버헤드를 시간으로 표현하는 것"**
 하나뿐이고, 라우팅·페이로드 변환·MMIO 디코딩 같은 부가 의사결정은 하지 않는다.
 ## Context
 PCIE_EP는 토폴로지 그래프에서 **호스트와 디바이스 사이의 단방향 경계 포인트**
 역할을 한다. 빌더 (`topology/builder.py`)는 SIP마다 IO chiplet 인스턴스를
 생성하고 그 안에 `pcie_ep`, `io_cpu`, `io_noc`을 둔 뒤, 외부 호스트 측의 cross-SIP
 switch와 `pcie_ep` 사이에 양방향 엣지를 깐다:
 - `switch → pcie_ep`: host → device 트래픽 (MemoryWrite, MemoryRead, KernelLaunch).
 - `pcie_ep → switch`: device-side outbound (예: cross-SIP IPCQ 토큰).
 IOChiplet 내부적으로는 `pcie_ep ↔ io_noc` 양방향 엣지가 깔리고, 그 다음 hop이
 `io_cpu`나 cube 측 hbm_ctrl 경로로 분기된다 (ADR-0036 IO_CPU 모델 참고).
 라우터·리졸버는 SPEC R7이 요구하는 "PCIE_EP는 메모리 오퍼레이션을 위한
 엔드포인트"라는 계약을 이미 인지하고 있어, `find_pcie_ep(sip)`,
 `find_memory_path(pcie_ep, dst_node)` 같은 helper가 PCIE_EP를 시작점으로 한다.
 문제는 이 모든 의존 관계가 builder/router/resolver 쪽에는 있으나, **PCIE_EP
 자신의 내부 모델을 명시하는 ADR이 없다**는 것이다. 결과적으로:
 - "PCIE_EP는 어떤 latency를 모델링하나?"가 코드를 읽어야만 답이 나온다.
 - 다른 컴포넌트(IO_CPU=ADR-0036, M_CPU=ADR-0035)와의 비대칭이 발생한다.
 - 향후 PCIe link-layer 모델(예: TLP credit, retry)을 더 정교하게 만들지에 대한
  의사결정 근거가 흩어진다.
 이 ADR은 현재의 **얇은 (thin) PCIE_EP 모델**을 명시적으로 못 박고, 그것이
 의도된 단순화임을 기록한다 (ADR-0033 latency model 단순화 정책과 정렬).
 ## Decision
 ### D1. PCIE_EP는 ComponentBase의 일반 forwarding 워커를 그대로 사용한다
 `PcieEpComponent`는 `ComponentBase`를 상속하며 `_worker`/`_forward_txn`을
 오버라이드하지 않는다. 따라서 모든 Transaction은 다음 순서로 처리된다:
 1. `_fan_in`이 들어오는 메시지(또는 Flit reassembly된 Transaction)를 `_inbox`에
   적재한다.
 2. `_worker`가 `_inbox`에서 하나 꺼내 `env.process(self._forward_txn(env, txn))`로
   포크한다 (per-message 파이프라이닝).
 3. `_forward_txn`이 op_log 시작 hook → `run()` 지연 → op_log 종료 hook 순서로
   호출한다.
 4. `run()`은 단 한 줄: `yield env.timeout(overhead_ns)`.
 5. 다음 hop이 있으면 `out_ports[next_hop].put(txn.advance())`, 없으면 (terminal로
   도착한 경우) `drain_ns`를 소비 후 `txn.done.succeed()`.
 ### D2. PCIE_EP의 유일한 시간 모델은 `overhead_ns`다
 `node.attrs["overhead_ns"]`만 latency 파라미터로 인정한다. 코드 기본값은
 `0.0`이며, `topology.yaml` 의 IOChiplet `components.pcie_ep.attrs` 가 실제 값을
 지정한다 (현재 토폴로지: `overhead_ns: 5.0` ns).
 별도의 BW 직렬화 자원(simpy.Resource), 큐 깊이, retry 모델은 두지 않는다.
 링크-레벨 BW 직렬화는 wire-side에서 처리된다 — IOChiplet 내부는
 `pcie_ep_to_noc_bw_gbs = 256.0 GB/s` 링크, 외부는 system의 `io_ep_to_switch`
 링크 BW가 적용된다 (ADR-0015 port/wire 모델). PCIE_EP 컴포넌트 자체는 이
 BW 회계에 관여하지 않는다.
 ### D3. PCIE_EP는 양방향 사용을 인지하지만, 방향에 따라 동작을 바꾸지 않는다
 토폴로지 빌더가 `switch ↔ pcie_ep` 와 `pcie_ep ↔ io_noc` 양방향 엣지를 깐다.
 따라서 PCIE_EP는:
 - inbound (host→device): switch에서 도착한 Transaction을 io_noc 쪽으로 다음 hop
  계산을 통해 forward.
 - outbound (device→host): io_noc/io_cpu에서 도착한 Transaction을 switch 쪽으로
  forward.
 두 경우 모두 D1의 일반 forwarding 워커가 처리하며, 컴포넌트 코드 자체는 방향을
 구분하지 않는다 (`txn.next_hop`만 따른다).
 ### D4. PCIE_EP는 Flit-aware가 아니다 (legacy reassembly 경로)
 `_FLIT_AWARE`를 `True`로 두지 않는다. 따라서 `_fan_in`이 상류에서 chunkify된
 Flit들을 부모 Transaction으로 재조립하여 `_inbox`에 넣는다 (ADR-0033 Phase 2c
 점진적 rollout 정책과 정렬).
 PCIE_EP가 PCIe TLP-level credit 모델을 갖도록 확장될 미래에 D4를 재평가한다.
 ### D5. PCIE_EP는 라우팅 helper의 **명명된 노드**다
 `policy/routing/router.py`의 `find_pcie_ep(sip, io_id="io0")`,
 `find_all_pcie_eps()`, `find_memory_path(pcie_ep, dst_node)`는 PCIE_EP를 메모리
 경로의 시작점(또는 종점)으로 간주한다. 컴포넌트 본체는 이 helper에 어떤 정보도
 제공하지 않으며, 명명 규칙(`sip{S}.{io_id}.pcie_ep`)은 토폴로지 빌더가 보장한다.
 ## Alternatives Considered
 ### A1. PCIe TLP-level 모델 (credit, retry, MPS 분할)
 기각. ADR-0033이 명시한 "현재 latency 모델은 abstract overhead + BW 직렬화로
 표현"이라는 단순화 원칙에 어긋난다. 호스트↔디바이스 protocol 정합성은 SPEC §5
 "Non-Goals"에 의해 의도적으로 out-of-scope이다.
 ### A2. PCIE_EP에 자체 simpy.Resource로 inflight 제한 두기
 기각. 현재 워크로드에서 호스트 트래픽은 컨텐션 병목이 아니다. 필요해지는 시점에
 별도 ADR로 도입한다 (호환성 측면에서 D1은 그대로 두고 D2를 확장하는 형태).
 ### A3. PCIE_EP를 IO_CPU와 합치기
 기각. PCIE_EP는 host-side에서 처음 만나는 protocol boundary 노드이고, IO_CPU는
 디바이스-쪽 control-plane 처리 노드다 (ADR-0036). 트래픽 fan-out·command 디코딩
 같은 의사결정 비용은 IO_CPU에 모이며, PCIE_EP는 link-edge overhead만 표현하는
 것이 의미가 있다. 합치면 두 책임이 섞여 ADR-0007 (runtime API/sim_engine 경계)
 정신에 어긋난다.
 ## Consequences
 - PCIE_EP는 코드 라인이 거의 0인 채로 명시적인 모델 ADR을 갖게 된다 — 일관성
  ↑, 유지보수 비용 ↓.
 - 향후 PCIe-level 정밀화가 필요해지면 D2/D4를 확장하는 새 ADR을 만들어
  supersede한다.
 - `find_memory_path` 등 router helper가 PCIE_EP를 명명된 노드로 의존한다는
  사실이 D5에서 명시되므로, 컴포넌트 ID 명명 규칙 변경 시 영향 범위가 명확해진다.
@@ -0,0 +1,194 @@
 # ADR-0039: PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할
 ## Status
 Accepted (2026-05-20).
 ADR-0011 (PA/VA/LA address model) 의 VA 모델에서 "PE_MMU가 VA→PA 변환"이라고만
 선언되어 있는데, **PE_MMU 컴포넌트 자신의 동작 모델**을 별도로 못 박는 ADR.
 ## First action (제일 처음에 하는 일)
 생성 시점에 `node.attrs["page_size"]` (default `2 MiB`) 와
 `node.attrs["tlb_overhead_ns"]` (default `0.0`) 를 읽어 내부 `PeMMU` 객체
 (`policy.address.pe_mmu.PeMMU`) 를 단 한 번 인스턴스화한다. 이 객체가 페이지
 테이블·서브페이지 region 리스트·TLB 오버헤드의 단일 보유자(single owner)이다.
 런타임에서의 첫 동작은 두 갈래로 갈린다:
 - **컴포넌트 경로 (inbox 소비)**: `_worker`가 `_inbox`에서 Transaction을 한 건
  꺼내, 그 `request`가 `MmuMapMsg`이면 각 엔트리에 대해
  `self._mmu.map(va, pa, size)`를 호출하고 `txn.done.succeed()`.
  `MmuUnmapMsg`이면 `unmap(va, size)`, 그 외 타입이면 표준 `_forward_txn`으로
  떨군다. 즉 **MMU의 첫 일은 "map/unmap 명령을 페이지 테이블에 반영하는 것"**.
 - **유틸리티 경로 (직접 호출)**: PE_DMA / PE_GEMM 같은 동일 PE 내부 엔진이
  `pe_mmu.mmu.translate(va)`를 직접 호출한다. 이 경로에서는 SimPy 이벤트가
  발생하지 않으며, 호출자가 (overhead_ns > 0인 경우) 본인 process에서
  `yield env.timeout(mmu.overhead_ns)`를 처리한다.
 ## Context
 ADR-0011은 PA/VA/LA 세 가지 주소 모델을 정의하고 "VA 모델 = PE_MMU를 통한 변환"
 이라고만 합의했다. 그러나 코드 상의 `PeMmuComponent`는 두 가지 상호 보완적인
 역할을 동시에 수행한다:
 1. **토폴로지 그래프 상의 컴포넌트**: cube NoC에서 `MmuMapMsg` / `MmuUnmapMsg`
   sideband 메시지를 수신하여 페이지 테이블을 갱신한다.
 2. **PE-로컬 유틸리티 객체**: 동일 PE의 PE_DMA / PE_GEMM이 latency 0으로 (혹은
   호출자 측에서 `overhead_ns`만 부담하면서) 직접 `translate(va)`를 호출한다.
 이 두 역할을 모두 다루는 ADR이 없어 다음 모호함이 발생한다:
 - "왜 MMU 변환에 SimPy 이벤트가 안 잡히나?" (실제로는 호출자 측에서 잡고 있음)
 - 서브페이지 region 모델은 무엇이고, 왜 그 모델인가? (코드 docstring에는 있으나
  ADR이 없음 — `project_mmu_subpage_stopgap`라는 memory note 참조만 존재)
 - map/unmap 메시지가 **누구로부터** 와서 **언제까지** 갱신되어야 하는가
  (ordering 계약)?
 또한 `PeMMU.map()` 은 "later append, last-write-wins (역방향 탐색)" 의미를 갖는데,
 이것은 단순한 단일-PA 페이지 테이블 모델로는 표현 불가능한 DPPolicy의 서브페이지
 샤딩 (예: 128B 페이로드 × 4KB 페이지) 시나리오를 위해 의도적으로 추가된
 **stopgap**이다. 진짜 HW MMU와는 다른 단순화임을 ADR로 못 박을 필요가 있다.
 ## Decision
 ### D1. 이중 역할의 명시 — 컴포넌트와 유틸리티
 `PeMmuComponent`는 단일 클래스 안에서 다음 두 인터페이스를 노출한다:
 - 컴포넌트 인터페이스: `_inbox` 소비, `_worker` 루프 (MMU sideband 메시지 처리).
 - 유틸리티 인터페이스: `pe_mmu.mmu` 속성으로 underlying `PeMMU` 객체를 노출 —
  PE_DMA / PE_GEMM이 이 객체를 직접 들고 `translate()`를 호출.
 후자는 **layer skip이 아니다**: PE 내부는 ADR-0007이 정의한 "components" 레이어
 하나 안의 sibling 관계이고, 같은 PE prefix에서 가져온 PE_MMU 객체에 대한 직접
 호출은 cross-layer가 아니다. cross-layer 위반은 runtime API / sim_engine /
 components 경계를 넘는 경우에만 적용된다.
 ### D2. Latency 모델: `translate()`는 순수 함수, overhead는 호출자 책임
 `PeMMU.translate()`는 순수 함수이며 SimPy yield를 하지 않는다. 호출자(PE 엔진)
 가 변환 후 `if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)`
 를 자기 process에서 발생시킨다.
 이유: PE 엔진의 SimPy process는 이미 자체 record_start / record_end (op_log)
 hook을 들고 있어 timing을 일관되게 잡을 수 있다. MMU가 별도의 process를 만들면
 PE 엔진의 처리 흐름을 두 갈래로 쪼개 op_log/pipeline overlap 의미가 흐려진다.
 #### D2.1. 현재 구현의 비대칭 — pipeline vs non-pipeline (Known asymmetry)
 본 ADR 작성 시점의 `pe_dma.py` 구현은 두 호출 경로에서 overhead 처리가 다르다:
 - **non-pipeline (`handle_command`)**: `translate()` 직후
  `if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)` 를
  발생시킨다.
 - **pipeline (`_do_pipeline_dma`)**: `translate()` 만 호출하고 overhead timeout을
  **생략**한다 — 함수 주석에 "same logic as non-pipeline path"라고 적혀 있으나
  실제로는 일치하지 않는다.
 기본 토폴로지에서 `tlb_overhead_ns = 0.0` 이라 이 차이는 timing에 직접 드러나지
 않으나, `tlb_overhead_ns > 0` 으로 설정한 시뮬레이션에서는 pipeline 경로의
 GEMM/Math 가 non-pipeline 동일 워크로드 대비 MMU overhead 만큼 빠르게 측정된다.
 D2의 계약은 "**모든** 호출자가 overhead를 책임진다" 이며, pipeline 경로의 누락은
 **의도된 설계가 아니라 구현 비일관성**이다. ADR-0014 D6 (pipeline self-routing)
 이 이 overhead를 면제한다고 명시한 부분은 없다.
 조치 선택지(별도 Phase 1/2 제안 필요):
 - (a) `_do_pipeline_dma` 에서도 `if mmu.overhead_ns > 0: yield env.timeout(...)`
  를 추가하여 D2 계약과 일치시킨다 — 권장.
 - (b) D2 계약을 "non-pipeline 경로에만 적용" 으로 좁히고, pipeline 경로의 면제를
  ADR-0014 D6 갱신과 함께 정당화한다 — overhead 의미가 약해지므로 비권장.
 본 ADR은 (a) 를 권장하며, accept 전 또는 직후의 별도 작은 변경으로 이를
 교정하는 것을 가정한다.
 ### D3. 페이지 테이블 구조 — 서브페이지 region 리스트 (stopgap)
 `self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]]`
 구조로 한 페이지 안에 여러 disjoint region을 보유할 수 있다.
 - `map(va, pa, size)`: 페이지를 가로지르면 region들을 **append**한다.
 - `translate(va)`: VPN으로 region 리스트를 가져온 후, **역방향**으로 순회하며
  처음 매칭되는 region을 채택 (last-write-wins).
 - `unmap(va, size)`: extent가 unmap 범위에 **완전히 포함된** region만 제거한다.
  경계가 어긋난 부분 overlap은 그대로 남기며, 매핑 호출자는 mapping과 동일한
  경계로 unmap할 책임을 진다.
 이는 진짜 HW MMU와는 다른 **시뮬레이터 stopgap**임을 ADR-0011 VA 모델 보강
 요소로 명시한다. DPPolicy 서브페이지 샤딩 시 last-write-wins overwrite로 인한
 조용한 미스라우팅을 방지하기 위함이다 (메모리 노트: project_mmu_subpage_stopgap).
 ### D4. PageFault는 PA fallback 신호다
 매핑이 없는 VA로 `translate()`가 호출되면 `PageFault`가 발생한다. PE_DMA는 이
 예외를 잡아 **원본 주소를 PA로 그대로 사용**한다 (ADR-0011의 PA fallback 호환
 경로). 따라서 PageFault는 에러가 아닌 "VA 매핑 부재 시 PA로 해석한다"는 신호다.
 이 호환 경로는 ADR-0011이 합의한 PA-only 모드와의 후방 호환을 유지하기 위한
 의도된 동작이다.
 ### D5. MMU sideband 메시지의 수신 계약
 `MmuMapMsg` / `MmuUnmapMsg`는 fabric을 통해 PE_MMU 컴포넌트의 `_inbox`로
 도달한다 (R10이 명시하는 "MMU map 설치는 fabric latency를 따른다"). 메시지
 schema는 runtime API (`runtime_api/kernel.py`) 가 정의하며, 현재 형식:
 - `MmuMapMsg.entries: tuple[dict, ...]` — 각 dict는 `{"va": int, "pa": int,
  "size": int}` 키를 갖는다.
 - `MmuUnmapMsg.entries: tuple[dict, ...]` — 각 dict는 `{"va": int, "size": int}`
  키를 갖는다.
 PE_MMU 측 수신 처리:
 1. `_worker` 가 `_inbox.get()` 에서 메시지 한 건을 꺼낸다.
 2. `hasattr(msg, "request")` 로 Transaction wrapper 인지 확인.
 3. `isinstance(msg.request, MmuMapMsg)` 이면 각 entry 에 대해
   `self._mmu.map(va=e["va"], pa=e["pa"], size=e["size"])`.
 4. `isinstance(msg.request, MmuUnmapMsg)` 이면 각 entry 에 대해
   `self._mmu.unmap(va=e["va"], size=e["size"])`.
 5. 둘 다 `msg.done.succeed()` 로 완료 통지.
 외부 호출자(runtime API 측)가 `done`을 await하면 "매핑이 디바이스에 설치된
 시점"이 SimPy 시간으로 보장된다 — 이 wait이 ADR-0011이 요구하는 "MMU map
 installation incurs measured fabric latency" 의 실현이다.
 이 ADR은 sideband 메시지의 **sender 와 fan-out 정책**을 정의하지 않는다 —
 그것은 runtime API 책임이다. 본 ADR은 PE_MMU 측 수신 계약만 명시한다.
 ### D6. 비-MMU Transaction은 일반 forwarding으로 위임
 `_worker`가 inbox에서 꺼낸 메시지의 `request`가 `MmuMapMsg` / `MmuUnmapMsg`가
 아닌 경우 (또는 `request` 속성이 없는 경우) `_forward_txn`으로 떨군다. 이는
 미래에 PE_MMU가 cube-internal NOC 상의 통과 노드로 사용될 가능성을 차단하지
 않기 위함이다 (현재는 그런 통과 트래픽이 없으나, 토폴로지 변경에 대해 안전).
 ## Alternatives Considered
 ### A1. translate()를 SimPy generator로 만들기
 기각. D2에서 설명한 대로, PE 엔진의 op_log/pipeline overlap 의미가 흐려진다.
 호출자 측에서 timeout을 일으키는 현재 패턴이 op_log 회계와 일치한다.
 ### A2. 서브페이지 region 리스트 대신 페이지 크기 자체를 작게 하기 (예: 128B)
 기각. 페이지 테이블 메모리 폭발과 cube-wide map message 크기 폭발을 초래한다.
 DPPolicy 샤딩이 128B를 요구한다 해도 그 외 대다수 매핑은 2MiB 단위이므로,
 페이지 크기를 작게 잡는 것은 평균 비용이 비대해진다.
 ### A3. PE_MMU를 컴포넌트가 아닌 PE_CPU의 내장 헬퍼로만 두기
 기각. ADR-0011이 요구하는 "fabric을 통해 측정된 latency로 MMU map 설치"
 (MmuMapMsg 경로)를 표현하려면 토폴로지 그래프 상의 노드여야 한다. 또한 cube NoC
 visualizer에서 PE_MMU가 노드로 보여야 디버깅·진단이 일관된다.
 ## Consequences
 - PE_MMU의 이중 역할(컴포넌트 + 유틸리티)이 ADR-level에서 정당화되어, 미래의
  refactor 압박 (둘 중 하나로 통일하라)에 대한 논거가 생긴다.
 - 서브페이지 region 모델이 시뮬레이터 stopgap임을 ADR이 명시 — 이후 LA 모델
  (ADR-0011) 도입 시 이 stopgap 제거 가능성을 평가하는 기준이 된다.
 - `translate()`가 yield하지 않는다는 계약이 ADR로 굳어지므로, 향후 누군가
  "MMU에 자체 timeout을 넣자"는 제안을 할 때 D2를 근거로 거절할 수 있다.
 - PA fallback (D4) 이 정상 흐름임이 명시되어, PageFault를 에러로 오인하여
  방어 로직을 추가하는 일을 막는다.
@@ -0,0 +1,142 @@
 # ADR-0040: PE_TCM Component Model — 듀얼 채널 BW 직렬화
 ## Status
 Accepted (2026-05-20).
 ADR-0014 (PE Pipeline Execution Model) 가 "PE_TCM은 BW-기반 직렬화 scratchpad
 memory" 라고 언급하나 (D1), TCM 컴포넌트 자체의 정확한 동작 모델을 별도로
 명시한다.
 ## First action (제일 처음에 하는 일)
 `start()`가 호출되면 즉시 두 개의 `simpy.Resource(env, capacity=1)`을 만들고
 `self._read_res` / `self._write_res`에 보관한다. 이 두 자원이 **읽기 채널**과
 **쓰기 채널**을 각각 1-in-flight로 직렬화하는 단일 결정 포인트다.
 런타임 첫 동작: `_worker`가 `_inbox`에서 메시지를 한 건 꺼내 타입 분기:
 - `TcmRequest` (`pe_fetch_store`에서 옴): `env.process(self._handle_tcm_request)`로
  포크. 즉 **TCM의 첫 일은 "방향 (read/write)에 맞는 채널 락을 잡는 것"**.
  락 획득 후 `bw > 0 and nbytes > 0` 이면 `delay_ns = nbytes / bw` 만큼
  `env.timeout`, 그리고 `req.done.succeed()`.
 - 그 외 (Transaction): `env.process(self._forward_txn)`로 포크 (legacy fabric
  통과 경로).
 생성 시점에 `node.attrs["read_bw_gbs"]` / `node.attrs["write_bw_gbs"]`
 (default 각 `512.0 GB/s`) 를 읽어 보관해 둔다.
 ## Context
 PE 파이프라인 (ADR-0014 D1, D6) 에서 PE_TCM은 다음 두 종류의 트래픽을 받는다:
 1. **PE_FETCH_STORE → PE_TCM의 `TcmRequest`** — TCM ↔ Register File 전송 시,
   PE_FETCH_STORE가 TCM의 BW로 직렬화된 access latency를 받아오기 위해 짧은
   sideband 요청을 보낸다 (`direction = "read"` 또는 `"write"`, `nbytes`,
   `done` 이벤트).
 2. **legacy Transaction forwarding** — 토폴로지 그래프 상에서 TCM이 통과 노드로
   잡힐 가능성에 대비한 일반 forwarding 경로 (현재 critical path에서는 사용되지
   않으나 보존됨).
 문제: ADR-0014는 "PE_TCM은 BW-기반 직렬화"라고만 언급한다. 그러나 코드에는
 명시적으로:
 - **읽기와 쓰기는 별도 채널이며 동시 진행 가능**, 다만 같은 방향끼리는
  cap=1로 직렬화된다.
 - BW는 `read_bw_gbs` / `write_bw_gbs` 두 값으로 분리 설정 가능하다.
 - `delay_ns = nbytes / bw_gbs` 공식 (단위 환산: GB/s × ns ≈ B 라는 약식).
 - nbytes==0이면 BW 항을 건너뛰지만 채널 락은 잡는다.
 - `run()`은 `overhead_ns` (default 0.0) 만큼 yield 하나, 이는 legacy fabric
  경로(Transaction forwarding)에서만 사용된다.
 이 모든 사항을 별도 ADR로 못 박을 필요가 있다. 특히 "왜 read/write가 분리
 채널인가" 와 "BW는 누가 결정하는가" 는 향후 누군가가 capacity=2 등으로 변경하려
 할 때 명확한 근거가 필요한 항목이다.
 ## Decision
 ### D1. 듀얼 채널 — read와 write는 독립 자원
 `_read_res = simpy.Resource(env, capacity=1)`,
 `_write_res = simpy.Resource(env, capacity=1)`.
 같은 방향의 동시 요청은 자원 큐에서 직렬화되나, 다른 방향끼리는 동시에 진행 가능.
 이는 실제 HW에서 TCM이 듀얼 포트 (read port + write port) 로 운용되는 모델과
 정합되며, GEMM 파이프라인에서 fetch(read)와 store(write)가 시간상 겹치는 정상
 케이스를 BW-직렬화 모델로 표현하기 위해 의도된 분리다.
 ### D2. 단일 채널의 BW 모델 — `nbytes / bw_gbs`
 채널 락 획득 후, `nbytes > 0 and bw > 0`이면 `yield env.timeout(nbytes / bw_gbs)`.
 단위 약식은 GB/s × ns ≈ B 로, 시뮬레이터 전체에서 사용하는 BW 공식과 동일
 (ADR-0033 참고 — 시뮬레이터는 일관된 약식 단위를 사용한다).
 - `nbytes == 0`: BW 항은 0이지만 락은 잡혔다가 즉시 풀린다. 이 케이스가 의도된
  이유: 빈 fetch/store를 보내는 plan generator가 PE_FETCH_STORE 측에서 `nbytes`만
  0으로 채워 보내는 경우에도, TCM 측의 op_log / 채널 회계가 일관되게 한 번
  소비된다.
 - `bw == 0` (config 실수): timeout 호출 자체를 skip하므로 0-time pass. 정상
  세팅에서는 발생하지 않는다.
 ### D3. BW는 `node.attrs`의 `read_bw_gbs` / `write_bw_gbs`로 설정
 기본값 `512.0 GB/s`. 토폴로지 빌더 (`topology/builder.py`) 가 `pe_template`에서
 TCM을 인스턴스화할 때 해당 attrs를 전달한다. 기본값 변경은 ADR-0014 D1 또는
 ADR-0033 latency model 측의 의사결정과 함께 가야 한다.
 ### D4. TcmRequest의 schema는 PE_TCM이 owner다
 `@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
 는 `components/builtin/pe_tcm.py`에 정의된다. PE_FETCH_STORE는 이 dataclass를
 import해서 생성·송신만 한다. 호출자 측이 schema를 정의하지 않는 이유:
 - BW 직렬화의 의미는 TCM 측 책임 — 어떤 필드가 직렬화 결정에 쓰이는가는 TCM이
  결정한다.
 - `direction` 문자열을 `"read"` / `"write"` 둘로 좁히는 유효값 검증도 TCM 측에
 서 담당 (`_handle_tcm_request`의 if/else 분기).
 ### D5. legacy Transaction forwarding 경로의 보존
 `_worker`가 `TcmRequest`가 아닌 메시지를 받으면 `_forward_txn`으로 보낸다. 이때
 `run()`의 `overhead_ns`가 적용된다. 현재 표준 PE 파이프라인에서는 TCM이
 Transaction의 통과 노드로 잡히지 않으나, fabric 토폴로지가 향후 변경될 때를
 위해 보존한다 (D1 의 사용 패턴과 직교).
 이 경로는 op_log 측에서 일반 Transaction 회계로 잡히며, BW 채널 락은 잡지 않는다.
 ### D6. PE_TCM은 자체 데이터 저장소가 아니다 (timing only)
 TCM은 **시간만** 모델링한다. 실제 데이터 페이로드는 sim_engine의 별도
 `memory_store` (있다면) 가 보관하고, TCM 컴포넌트는 그것을 갱신하지 않는다.
 PE_FETCH_STORE도 TcmRequest를 통해 BW 지연만 받아오고 실제 register 컨텐츠는
 별도 경로로 다룬다 (ADR-0020 2-pass data execution 모델 — Phase 2에서 데이터
 처리).
 ## Alternatives Considered
 ### A1. 단일 채널 (capacity=2 의 read+write 공유)
 기각. fetch(read)와 store(write)가 시간상 겹치는 정상 케이스를 인공적으로
 직렬화하게 되어 PE 파이프라인의 BW upper bound가 잘못 모델링된다.
 ### A2. 채널 capacity > 1 (예: 2-banked TCM)
 기각. 현재 HW 모델은 단일 bank 가정. 멀티-bank로 확장하고 싶다면 별도 ADR이
 필요하며, 그때 D1을 supersede한다. 지금 단계에서 capacity를 늘리면 BW upper
 bound는 그대로인데 명목상의 직렬화만 헐거워져 실제 모델 정확도 ↓.
 ### A3. BW 공식을 `nbytes / bw + overhead_ns`로 일반화
 기각. `overhead_ns`는 D5의 legacy forwarding 경로에만 사용한다. fetch/store
 critical path에 추가 overhead가 필요해지면, 그것은 TCM이 아니라 PE_FETCH_STORE
 측 `run()` 또는 register-file access 모델에 두는 것이 책임 경계 측면에서 더
 적절하다.
 ## Consequences
 - TCM의 BW 회계가 ADR-level에서 굳어지므로, GEMM/Math sweep의 op_log 해석 시
  "왜 fetch와 store가 동시에 진행되었나" / "왜 같은 방향만 직렬화되나" 같은
  질문이 빠르게 D1으로 해결된다.
 - 미래의 멀티-bank TCM이나 read/write 비대칭 BW 모델 변경 시 영향 범위가
  명확해진다 (D1·D2·D3 중 어디를 수정하는지).
 - TCM이 데이터 저장소가 아니라는 점(D6)이 명시되어, ADR-0020 2-pass execution
  과의 책임 경계가 견고해진다.
@@ -0,0 +1,187 @@
 # ADR-0041: Cube SRAM Component Model — terminal scratchpad on cube NoC
 ## Status
 Accepted (2026-05-20).
 ADR-0017 (Cube NOC and HBM Connectivity) 에서 SRAM이 cube NoC의 attachment로
 존재한다고만 언급되는 점을 보완하여, SRAM 컴포넌트 자체의 latency/response
 모델을 명시한다.
 ## First action (제일 처음에 하는 일)
 `_worker`가 `_inbox`에서 Transaction을 한 건 꺼낸 직후 가장 먼저 하는 일은
 `yield from self.run(env, txn.nbytes)` 호출이고, 그 안에서
 `node.attrs["overhead_ns"]` (default `0.0`) 만큼 `env.timeout()`을 발생시킨다.
 즉, **SRAM의 첫 일은 "access overhead를 시간으로 표현하는 것"**이다.
 overhead 소비 이후에 `drain_ns` (그 Transaction에 부여된 terminal BW 직렬화 비용)
 를 yield하고, 그 다음에 reverse path로 `ResponseMsg`를 생성하여 발사한다.
 이는 일반 `ComponentBase._worker`와 다른 점이 있다: SRAM은 **terminal node**
 임을 알고 있어서 `_forward_txn`을 거치지 않고 자체 워커가 `run → drain →
 _send_response` 순서를 명시한다.
 ## Context
 cube 토폴로지 (`topology/builder.py`) 는 cube마다 다음 명명된 노드를 만든다:
 - `sip{S}.cube{C}.m_cpu`
 - `sip{S}.cube{C}.sram`
 - `sip{S}.cube{C}.hbm_ctrl` (PE당 partition)
 - `sip{S}.cube{C}.pe{P}` (PE 내부 sub-component들)
 SRAM은 cube NoC 의 attachment 중 하나로, 가장 가까운 router에 부착된다
 (`topology/mesh_gen.py`가 placement 좌표로 nearest router 결정 후 `attach`에
 추가). 빌더는 `sram ↔ router` 양방향 엣지를 깐다 (BW: `sram_to_router_bw_gbs`,
 기본 `128.0 GB/s`).
 SRAM의 두 가지 핵심 역할:
 1. **fabric terminal**: cube NoC에서 SRAM으로 향한 메모리 access Transaction의
   끝점. SRAM이 access overhead와 drain을 소비하고 response를 reverse path로
   되돌린다.
 2. **IPCQ slot tier 중 하나**: ADR-0023 D9.7 가 정의한 `buffer_kind ∈ {tcm,
   sram, hbm}` 중 `sram` 티어의 slot bw/overhead를
   `common/ipcq_types._BUFFER_KIND_BW`에서 참조 — 현재 값 `(512.0 GB/s, 2.0 ns)`.
   이 값은 SRAM 노드 attrs의 `overhead_ns`와는 별도이며, IPCQ slot 회계 시점에서
   PE_DMA가 시간으로 환산한다.
 이 두 역할은 하나의 SRAM 컴포넌트에서 동시에 충족되는데, 별도 ADR이 없으면:
 - "SRAM은 어떤 latency를 모델링하나?" — fabric drain + overhead, 아니면 IPCQ
  티어의 slot latency? — 답이 흩어진다.
 - 미래에 SRAM 크기 (`size_mb`) attr이 실제로 어떤 의미를 갖는지 불명확. 현재
  코드는 size를 사용하지 않으며 timing만 모델링한다.
 - SRAM이 cube의 어떤 router에 붙는지 (placement-based)에 대한 의사결정 근거가
  토폴로지 코드 안에만 있다.
 ## Decision
 ### D1. SRAM은 cube NoC의 terminal scratchpad 노드다
 `SramComponent`는 `ComponentBase`를 상속하나 `_worker`를 오버라이드해서 terminal
 의미를 직접 표현한다:
 ```
 while True:
    txn = yield self._inbox.get()
    yield from self.run(env, txn.nbytes)     # overhead_ns
    if drain_ns > 0: yield env.timeout(drain_ns)
    yield from self._send_response(env, txn)
 ```
 이 패턴은 SRAM이 reverse path를 알아야 하므로 일반 `_forward_txn` (다음 hop으로
 forward)이 아닌 자체 워커가 필요하다.
 #### D1.1. 현재 미사용 — `_worker` 오버라이드는 dormant 경로다
 본 ADR 작성 시점의 코드베이스에서는, **어떤 컴포넌트도 SRAM 노드로 Transaction
 을 실제로 전송하지 않는다**. 확인된 SRAM 노드 ID 참조 위치:
 - `policy/routing/router.py` 등 routing helper — path 조회 가능성만 보장.
 - `components/builtin/pe_dma.py::_handle_ipcq_inbound` — IPCQ slot의
  `buffer_kind == "sram"` 일 때 `bank_node = f"{cube_prefix}.sram"` 의 *path*
  만 조회하여 `compute_drain_ns(path, ...)` 로 환산, **로컬에서 timeout** 한다.
  Transaction 자체는 SRAM 노드로 흘러가지 않는다 (D4 참고).
 - `tests/test_routing.py` — `find_path("sip0.cube0.pe0", "sip0.cube0.sram")`
  로 connectivity만 검증.
 따라서 `_worker`/`_send_response` 오버라이드는 **dormant code path** 이다.
 삭제하지 않고 보존하는 이유:
 - 향후 SRAM이 실제 fabric Transaction의 종점(예: M_CPU → SRAM 명시 access)이
  되는 토폴로지 변경 시 즉시 사용 가능.
 - ADR-0017 (Cube NOC) 가 정의한 cube-attached scratchpad 의미에서 종점 동작은
  의미상 자연스러우므로, 의도된 placeholder 다.
 이 dormant 상태가 종료되는 시점은 별도 ADR(또는 본 ADR의 후속 revision)이
 명시한다.
 ### D2. ResponseMsg 생성과 reverse path 발사
 `_send_response`는:
 1. `reverse_path = list(reversed(txn.path))`로 역방향 경로 산출.
 2. `ResponseMsg(correlation_id=txn.request.correlation_id, request_id=...,
   src_cube=<this cube>, src_pe=-1, success=True)` 생성.
 3. `Transaction(request=resp_msg, path=reverse_path, step=0, nbytes=0,
   done=env.event(), is_response=True)` 로 감싸 `out_ports[reverse_path[1]]` 로
   put.
 4. reverse path가 비정상이거나 (`< 2 hops`) ctx가 없으면, fallback으로 원본
   `txn.done.succeed()` 만 호출.
 `src_pe = -1`은 "SRAM은 PE-localized가 아니다"를 의미한다. `src_cube`은 노드
 ID (`sip{S}.cube{C}.sram`) 의 cube 인덱스를 파싱해 채운다.
 ### D3. Timing 파라미터는 `overhead_ns`와 wire-side `drain_ns`로 분리
 - **컴포넌트 측 latency**: `node.attrs["overhead_ns"]`. 기본 토폴로지에서는 `2.0
  ns` 정도로 세팅.
 - **링크 측 직렬화**: `drain_ns`는 Transaction이 도착 시점에 carry해 온 값으로,
  ADR-0015 (port/wire 모델) 의 wire-side BW 직렬화 결과다. SRAM은 이를 그대로
  yield하기만 한다.
 - `size_mb` (default `32 MiB`) attr은 현재 timing에 사용되지 않는다 — 향후
  capacity-aware 모델이 도입되면 그때 의미를 부여한다 (별도 ADR에서).
 ### D4. IPCQ slot 회계는 SRAM 컴포넌트가 직접 모델링하지 않는다
 ADR-0023 D9.7 에 따른 IPCQ slot의 SRAM-티어 write latency는 PE_DMA의
 `_handle_ipcq_inbound`가 직접 `slot_io_latency_ns("sram", nbytes)`를 호출하여
 시간을 소비한다 (그 함수는 `common/ipcq_types._BUFFER_KIND_BW["sram"]` 의 값을
 사용). 즉:
 - SRAM 컴포넌트가 fabric Transaction을 받아 처리할 때는 **D1·D2·D3** 만 적용.
 - IPCQ slot이 SRAM에 살 때는 PE_DMA가 IPCQ slot-write 시점에 별도로 시간을
  지불 — 이는 SRAM 컴포넌트 코드와 무관하며, IPCQ 측 회계다.
 이 분리는 의도된 것: IPCQ는 fast path (sub-cycle slot bookkeeping) 라 fabric
 Transaction을 거치지 않으므로, SRAM이 IPCQ를 인지할 필요가 없다.
 ### D5. SRAM의 cube NoC 부착 위치는 placement-driven
 `topology/mesh_gen.py`는 `placement.sram.pos_mm` (`topology.yaml` 기본
 `[1.5, 9.0]`)을 보고 가장 가까운 router의 `attach`에 `"sram"`을 추가한다. 빌더
 (`topology/builder.py` 의 attachment 루프)가 그 attach 정보를 보고 `sram` 노드와
 router 사이에 양방향 엣지를 깐다.
 이 의사결정은 SRAM 컴포넌트 코드 외부 (mesh_gen / builder) 에 있으며, 컴포넌트
 는 어느 router에 붙었는지 알 필요가 없다. 컴포넌트는 `txn.path` / `reverse_path`
 가 router를 거쳐 자신에게 도달한다는 사실만 알면 된다.
 ### D6. SRAM은 자체 데이터 저장소가 아니다 (timing-only)
 ADR-0040 D6 과 같은 맥락: SRAM 컴포넌트는 시간만 모델링하며, 실제 데이터
 페이로드는 sim_engine의 `memory_store` (있을 때) 가 보관한다.
 ## Alternatives Considered
 ### A1. SRAM이 `_forward_txn`을 그대로 사용하고 IO_CPU / HBM_CTRL 처럼 별도 응답 노드를 두기
 기각. cube NoC 상에서 SRAM은 terminal이며, 응답을 받아 줄 별도 노드를 두면
 의미 없는 hop이 늘어나고 ADR-0017 의 cube NoC 단순화 정신에 어긋난다.
 ### A2. SRAM이 BW 직렬화를 자체 resource로 모델링
 기각. 링크 측 BW 직렬화 (`drain_ns`) 가 이미 의미를 충분히 잡고 있다. 컴포넌트
 내부에 또 `simpy.Resource`를 두면 ADR-0015 wire-side 모델과 이중계산을 야기.
 ### A3. SRAM이 IPCQ slot 회계를 컴포넌트 측에서 처리
 기각. D4에서 명시한 대로 IPCQ는 fast path며 fabric Transaction을 통과하지
 않는다. SRAM이 IPCQ를 인지하면 책임이 두 갈래로 갈라져 추론이 어려워진다.
 ### A4. `size_mb`로 capacity-aware latency 모델
 기각 (현재 단계). capacity는 토폴로지 visualizer 측 라벨링 정도에만 쓰이며,
 실제 timing 영향은 아직 모델링하지 않는다. 필요해지면 별도 ADR로 도입.
 ## Consequences
 - SRAM의 timing 모델이 `overhead_ns + drain_ns + ResponseMsg(reverse_path)`로
  ADR-level에서 굳어지므로, 누군가 IPCQ slot latency를 SRAM 컴포넌트에 추가하려
  할 때 D4를 근거로 거절할 수 있다.
 - `size_mb` 가 현재 timing-neutral 임이 명시되어 (D3), 미래의 capacity-aware
  모델 도입 시 호환성 영향 범위가 좁다.
 - placement-driven router 부착 (D5) 이 명시되어, SRAM 좌표 이동 시 어떤 부분에
  파급이 있는지 (`mesh_gen`만) 명확해진다.
@@ -0,0 +1,194 @@
 # ADR-0042: Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더
 ## Status
 Accepted (2026-05-20).
 본 ADR은 `tiling.py`가 SimPy 컴포넌트가 아니라
 **plan-generator 모듈**임을 명시한다.
 ADR-0014 (PE Pipeline Execution Model) 의 D6 (tile plan / self-routing) 가
 tile-plan 생성 알고리즘을 직접 정의하지 않으므로, 본 ADR이 그 비어 있는 자리를
 채운다.
 ## First action (제일 처음에 하는 일)
 `generate_gemm_plan(M, K, N, tile_m, tile_k, tile_n, ..., pe_prefix, a_pinned,
 b_pinned, epilogue_specs)`이 호출되면 가장 먼저 하는 일은 **타일 수 계산과
 컴포넌트 ID 문자열 구성**이다:
 ```
 M_tiles = max(1, ceil(M / tile_m))
 K_tiles = max(1, ceil(K / tile_k))
 N_tiles = max(1, ceil(N / tile_n))
 dma_id   = f"{pe_prefix}.pe_dma"
 fetch_id = f"{pe_prefix}.pe_fetch_store"
 gemm_id  = f"{pe_prefix}.pe_gemm"
 math_id  = f"{pe_prefix}.pe_math"
 ```
 즉 **plan generator의 첫 일은 "타일 개수를 ceiling으로 산출하고, 이 PE의
 sub-component ID 4개를 한 번에 짜놓는 것"**이다. SimPy 이벤트나 환경 객체는
 일절 다루지 않는다 — 이 모듈은 순수 함수다.
 `generate_math_plan(M, N, tile_m, tile_n, ..., math_op, src_addr, dst_addr,
 pe_prefix)` 도 마찬가지로 `M_tiles`, `N_tiles` 산출과 component ID 3개
 (`dma_id`, `fetch_id`, `math_id`) 구성이 첫 일이다.
 ## Context
 ADR-0014 D6은 "PE_SCHEDULER가 CompositeCmd를 받으면 TilePlan을 생성하고
 self-routing tile token을 피드한다"고만 합의했다. 그러나 코드에서는 **plan
 생성 알고리즘의 구체적 내용**이 `src/kernbench/components/builtin/tiling.py`
 모듈에 자리잡고 있고, 이 모듈은:
 - 컴포넌트가 아니라 **순수 함수**의 모음이다 (`generate_gemm_plan`,
  `generate_math_plan`).
 - SimPy 환경, 큐, op_log, hook 등에 의존하지 않는다.
 - 결과로 `PipelinePlan` (dataclass) 를 돌려준다.
 기존 G4 분석은 `tiling.py`를 컴포넌트로 잘못 가정했으나, 실제는 PE_SCHEDULER에
 주입되는 plan-builder 함수다. 이 차이는 ADR-0014 의 D6 와 짝을 이루는 별도
 ADR로 못 박혀야 한다 — 그렇지 않으면:
 - "tile plan을 만드는 책임이 PE_SCHEDULER인가 별도 모듈인가" 가 모호.
 - GEMM plan과 Math plan의 stage sequence 가 일관성 있는지 (예: FETCH/STORE 위치)
  의사결정 근거가 흩어진다.
 - `a_pinned` / `b_pinned` / `epilogue_specs` 같은 옵션이 왜 plan 단에서 분기되는지
  근거 없음.
 ## Decision
 ### D1. tiling은 순수 plan-generator 모듈이며 컴포넌트가 아니다
 `components/builtin/tiling.py`는 ComponentBase 하위 클래스를 정의하지 않는다.
 모듈-레벨 함수 두 개만 노출한다:
 - `generate_gemm_plan(...) -> PipelinePlan`
 - `generate_math_plan(...) -> PipelinePlan`
 토폴로지 그래프에서 `tiling` 이라는 노드는 존재하지 않는다. 명명상 `builtin/`
 디렉터리에 있는 이유는 PE_SCHEDULER (ADR-0014 D6) 의 직접 helper이기 때문이며,
 의미상으로는 PE_SCHEDULER 내부 utility에 가깝다.
 ### D2. GEMM plan의 stage 시퀀스 — `M → N → K` order
 각 (m, n, k) 타일에 대한 stage 시퀀스 (operand pinning과 epilogue 미적용 기본):
 ```
 [DMA_READ(A)] → [DMA_READ(B)] → FETCH → GEMM
                                ↑
                                ↓
 (last k tile only)              [MATH(output_tile)]* → STORE → DMA_WRITE
 ```
 `k_tile` epilogue는 매 K-타일마다 GEMM 직후, `output_tile` epilogue는 (m,n)당
 마지막 K-타일에서 STORE/DMA_WRITE 직전에 한 번. K-루프 누적자(accumulator) 는
 RegFile에 남아 K 타일들 사이에 STORE/DMA_WRITE가 발생하지 않는다 (last_k에서만
 출력).
 ### D3. Operand pinning — `a_pinned` / `b_pinned`
 호출자가 `a_pinned=True`로 호출하면 **모든 (m, n, k) 타일에서 A DMA_READ를
 생략**한다. 의미: 호출자(예: `tl.composite`)가 사전에 `tl.load`로 A 전체를
 TCM에 한 번 적재했음을 plan generator에 알리는 신호.
 이 분기는 plan 단에서 결정한다 (런타임 분기 아님). 따라서 op_log 상의 stage
 record 수는 pinning에 따라 결정적으로 달라지며, sweep 분석 측 (예: gemm_sweep
 의 stage record count) 이 이 결정을 그대로 본다.
 ### D4. Epilogue scope — `k_tile` vs `output_tile`
 `epilogue_specs`는 op-spec 객체의 iterable이다. 각 op 객체는 다음 속성을 갖는
 다고 가정한다:
 - `op.kind: str` — math op 이름 (예: `"dequant"`, `"bias"`, `"relu"`, `"scale"`).
  stage의 `params["op_kind"]` 로 들어간다.
 - `op.scope: Scope` — `Scope.K_TILE` 또는 `Scope.OUTPUT_TILE` (`Scope` 는
  `kernbench.common.pe_commands` 에 정의된 enum).
 - op-별 추가 필드 (예: `bias`, `scale`, `factor`) — 현재 plan generator는 사용
  하지 않으며 런타임 (PE_MATH) 측이 소비.
 plan generator는 `getattr(o, "scope", None)` 기준으로 두 그룹으로 분기:
 - `scope == Scope.K_TILE`: 매 K-타일 GEMM 직후 MATH stage 추가.
 - `scope == Scope.OUTPUT_TILE`: (m, n)당 마지막 K-타일 STORE 직전 MATH stage
  추가.
 `scope` 속성이 없거나 두 enum 어느 쪽도 아닌 op는 **plan에 포함되지 않는다**
 (`getattr(..., None) == Scope.X` 가 둘 다 False). 기본값(`output_tile`) 채택은
 **호출자(예: `tl.composite`) 측 책임**이며, plan generator는 이미 채워진 scope
 값을 보고 분기할 뿐이다 (ADR-0014 의 composite epilogue 계약과 정렬).
 `Scope` 임포트는 `pe_commands ← pe_types ← tiling` 의 순환 참조를 피하기 위해
 함수 내부에서 lazy import 한다. 이는 의도된 패턴이며 개선 대상이 아니다 (D1의
 "tiling은 PE_SCHEDULER의 utility" 관점에서, pe_commands에 대한 컴파일타임 의존
 이 없는 편이 모듈 경계를 깔끔히 유지함).
 ### D5. Math plan의 stage 시퀀스 — `M → N` order
 각 (m, n) 타일에 대한 stage 시퀀스:
 ```
 DMA_READ → FETCH → MATH → STORE → DMA_WRITE
 ```
 K 차원이 없으므로 epilogue / accumulator residency 같은 개념은 적용되지 않는다.
 PE_FETCH_STORE의 register-file 회계는 GEMM plan과 동일한 방식으로 다뤄진다.
 ### D6. plan은 데이터다 — SimPy 의존성 없음
 `PipelinePlan` 은 `pe_types.py`에 정의된 dataclass로, `tiles: list[TilePlan]`을
 보유. 각 `TilePlan` 은 `stages: tuple[Stage, ...]` 를 보유. plan 자체는
 immutable에 가까운 데이터 구조이며 (Stage 의 `params: dict` 만 mutable),
 SimPy 객체나 event를 갖지 않는다.
 런타임 시점에 PE_SCHEDULER가 plan 의 첫 stage를 보고 `TileToken`을 생성하여
 파이프라인에 피드하며, TileToken 이 `plan: TilePlan`, `stage_idx: int`,
 `params: dict` 를 들고 다닌다. self-routing은 `TileToken.advance()` 가 다음
 stage의 `params`를 캐시하는 방식으로 진행된다 (ADR-0014 D6).
 ### D7. plan generator의 contract — pure, deterministic, idempotent
 같은 입력으로 두 번 호출하면 같은 PipelinePlan을 돌려준다 (`TilePlan.stages`의
 순서까지 deterministic). 이 contract는 ADR-0014 D6 의 "결정적 tile dispatch
 순서" 요구와 정렬된다.
 부수효과(SimPy event, file I/O, 글로벌 상태) 없음 — 테스트에서 환경 객체 없이
 호출 가능 (`tests/test_pe_pipeline.py`의 일부 케이스가 이 방식 사용).
 ## Alternatives Considered
 ### A1. tiling을 컴포넌트로 만들기 (e.g., PE_PLANNER)
 기각. plan 생성은 SimPy 시간을 소비하지 않는 결정 알고리즘이다. 컴포넌트로
 만들면 (a) inbox·자원 등 불필요한 인프라가 따라붙고, (b) PE_SCHEDULER 가
 "plan 받기" → "tile 피드" 두 단계를 분리해 받게 되어 의미 없는 hop이 생긴다.
 ### A2. plan 생성을 PE_SCHEDULER 클래스 메서드로 옮기기
 기각 (현재). 모듈 분리가 (1) 테스트 용이성, (2) 다른 plan 알고리즘 (예:
 DTensor-aware plan) 도입 시 추가 함수만 정의하면 되는 확장성을 준다. 만약 향후
 plan 종류가 많아져 명시적 dispatch가 필요해지면, 그때 PE_SCHEDULER에 plan
 factory를 두는 것을 별도 ADR로 도입한다.
 ### A3. plan을 immutable로 강제 (frozen dataclass + tuple)
 부분 채택. `Stage` 와 `TilePlan` 은 dataclass지만 frozen은 아니다. 이유:
 `Stage.params: dict` 가 plan generator 시점에 채워지고 런타임에서 읽히기만 한다
 (TileToken 이 advance 시 캐시할 뿐). 완전 frozen은 dict → frozendict 마이그레이션
 비용 대비 이득이 적다. 다만 plan 단계 외에는 mutation 하지 말 것을 컨벤션으로
 유지한다.
 ## Consequences
 - `tiling.py`가 컴포넌트가 아니라 plan-generator 모듈임이 ADR-level에서
  명시되어, G4 같은 미래의 "이 컴포넌트는 ADR이 없다"는 분석을 차단한다.
 - GEMM plan의 stage sequence (D2) 와 pinning/epilogue 분기 (D3·D4) 가 ADR로
  굳어지므로, sweep 분석 (`scripts/gemm_sweep.py`)의 stage record count 해석
  근거가 명확해진다.
 - plan generator의 pure contract (D7) 덕분에 테스트가 환경 없이 plan 검증
  가능 — ADR-0013 (verification strategy) 의 "behavior validated by tests with
  meaningful input cases" 정신과 정렬.
 - 향후 DTensor-aware plan, K-major plan 등 새 plan 종류 추가 시 본 ADR이
  baseline 역할 — 새 함수만 추가하고 D1·D6·D7을 따른다.
@@ -0,0 +1,126 @@
 # ADR-0043: Allreduce 평가 하니스 — `tests/sccl/`
 ## Status
 Accepted
 `tests/sccl/` 평가 하니스를 문서화한다; 구현과 대조 검증 완료
 (상수, 파일 집합, 스윕 차원을 교차 확인).
 ## Context
 ADR-0032는 intercube all-reduce *알고리즘*을 정의하고, ADR-0023/0024/0027은
 IPCQ 백엔드, rank=SIP launcher, `mp.spawn`을 정의한다. 그러나 어느 것도
 **allreduce를 어떻게 구동하고 특성화하는가** — 정확성 테스트, latency/
 buffer-kind 스윕, 파생 플롯 — 는 기술하지 않는다. ADR-0013(verification
 strategy)이 일반 정책이라면, 본 ADR은 구체적 allreduce 하니스를 고정하여
 작업의 "평가" 절반이 구현과 함께 문서화되도록 한다.
 하니스는 `tests/sccl/`(allreduce 테스트 통합 시 생성된 패키지)에 위치한다.
 이전의 평면적 `tests/test_allreduce_multidevice.py` +
 `tests/test_distributed_*` 레이아웃을 대체한다.
 ## Decision
 ### D1. 평가를 공개 `torch.distributed` 경로로 구동
 정확성과 스윕은 collective를 실제 DDP 형태 경로 —
 `init_process_group(backend="ahbm") → mp.spawn → dist.all_reduce`
 (ADR-0024/0027) — 로 실행하며, 하위 레벨 `ctx.launch`를 쓰지 않는다.
 `tests/sccl/_allreduce_helpers.py`의 공유 헬퍼
 `_run_distributed(tmp_path, monkeypatch, topo_path, corr_id, n_elem)`가
 엔진을 빌드하고 워커를 실행하고 `(engine, n_cubes)`를 반환한다.
 `monkeypatch.chdir`이 백엔드의 `load_ccl_config()`(cwd 조회)를 케이스별
 임시 `ccl.yaml`로 향하게 한다.
 직접 launch 레퍼런스(`run_allreduce`)는 같은 헬퍼 모듈에 유지된다 —
 distributed 테스트용이 아니라, `tests/`의 IPCQ buffer-kind / root-center
 마이크로 테스트가 import하기 때문이다.
 ### D2. 평가 관심사별 파일 하나
 | 파일 | 관심사 | `torch.distributed`? |
 |---|---|---|
 | `test_allreduce_ring_torus_mesh.py` | ring_1d / torus_2d (2×3) / mesh_2d_no_wrap (2×3) 정확성 | yes |
 | `test_distributed_default_topology.py` | `topology.yaml` 그대로의 전체 경로 | yes |
 | `test_plot_latency_sweep.py` | latency 스윕 행 (n_elem × topology) | yes |
 | `test_plot_buffer_kind_sweep.py` | TCM/SRAM/HBM 스윕 행 | yes |
 | `test_plot_topology_diagram.py` | topology.png (순수 matplotlib) | no |
 | `test_plot_comparison_fsim.py` | broken-axis 모델 vs FSIM 비교 | no |
 | `test_intercube_root_center.py` | ADR-0032 center-root latency 가드 (직접 경로) | no |
 `_allreduce_helpers.py`는 공유 plumbing(드라이버, config writer, 스윕/
 buffer-kind 상수, 플롯 aggregator, topology-diagram + FSIM 비교 emitter)을
 보유한다. 수집되지 않는다(`test_` 접두사 없음).
 ### D3. Latency 메트릭 — critical-path `pe_exec_ns`
 config별 보고 latency는 `engine._results`에 대한
 `crit_ns = max(pe_exec_ns)` — 가장 느린 rank의 PE 실행 시간 — 이다.
 모든 latency 차트에 그려지고 `summary.csv`에 기록되는 값이다.
 ### D4. 스윕 차원
 - **Latency 스윕**: `n_elem ∈ {8, 32, 64, 128, 512, 1024, 2048, 4096,
  8192, 16384, 32768, 49152}` (16 제외 — `n_cubes`와 충돌) × topology ∈
  {ring_1d (6), torus_2d 2×3 (6), mesh_2d_no_wrap 2×3 (6)}.
 - **Buffer-kind 스윕**: `buffer_kind ∈ {tcm, sram, hbm}` × 더 작은
  `n_elem` 그리드, torus_2d 6-SIP (3×2)에서. buffer_kind는 임시
  `ccl.yaml`에 설정되며(백엔드가 `init_process_group` 시점에 읽음,
  ADR-0023 D6) 적용된다.
 2×3 / 3×2 그리드는 명시적 `w/h` SIP 해석(ADR-0024 D5)을 행사한다.
 ### D5. `pytest_sessionfinish` aggregator를 통한 파생 플롯
 스윕 테스트는 xdist 친화적이다: 각 parametrized 케이스가 staging 디렉터리에
 JSON 행 하나를 쓴다. conftest `pytest_sessionfinish` 훅(controller 노드
 전용)이 `_allreduce_helpers.py`의 aggregator를 호출한다:
 - `_aggregate_sweep_plots()` → topology별 PNG + `summary.csv`
 - `aggregate_buffer_kind_plot()` → TCM/SRAM/HBM 비교 PNG + csv
 topology-diagram 및 FSIM-비교 figure는 각자의 `test_plot_*` 테스트가
 직접 emit한다(행 staging 없음 — 각각 `topology.yaml`과 `summary.csv`의
 순수 함수). 모든 출력은 `docs/diagrams/allreduce_latency_plots/`에 떨어지며
 CLAUDE.md에 따라 **파생 아티팩트**다(ADR과 일관, Phase-2 게이트 없음).
 ### D6. FSIM 비교 레퍼런스는 하드코딩 상수
 `emit_comparison_fsim_plot()`은 모델 곡선을 외부 FSIM single-device
 레퍼런스(`366 µs`) 하나와 겹쳐 그리며, 이는 리터럴로 보유된다 — 외부 데이터
 파일 없음. "measured" 시리즈는 시뮬레이터(`op_log` GEMM 카운트,
 `composite_window_ns`)에서, "theoretical" 시리즈는 손으로 도출한 해석적
 모델(ADR-0044 D5가 ADR-미검증으로 표시한 동일 모델)에서 온다.
 ## Consequences
 ### Positive
 - allreduce가 실제 DDP 스크립트와 같은 API로 평가되므로, 하니스가
  ADR-0024/0027의 통합 테스트 역할도 겸한다.
 - figure는 매 `pytest` 실행마다 committed 데이터로 재생성된다; 수동 플롯
  단계 없음.
 - 직사각형 그리드 스윕이 ADR-0024 D5 `w/h` 수정을 드러낸 회귀 커버리지를
  제공했다.
 ### Negative / limitations
 - 전체 latency 스윕은 기본 `pytest`에서 실행된다(~분 단위); `slow`로
  표시되지 않는다. (ADR-0044는 GEMM 스윕을 `slow`로 표시하는 것과 대조.)
 - `test_intercube_root_center.py`는 latency *임계값* assertion(ADR-0032
  center-root 가드)을 보유한다 — 스위트에서 유일한 절대-latency
  assertion이며 latency 모델 변경(ADR-0033)에 민감하다.
 ## Dependencies
 - **ADR-0013**: verification strategy (본 ADR이 특수화하는 일반 정책).
 - **ADR-0023 / ADR-0024 / ADR-0027**: IPCQ 백엔드, rank=SIP launcher,
  `mp.spawn` — D1이 구동하는 경로.
 - **ADR-0032**: 평가 대상 알고리즘; D4 그리드가 그 topology 분기를 행사.
 - **ADR-0044**: 형제 격인 GEMM 평가 하니스.
 ## Open questions
 - GEMM 스윕과의 일관성을 위해 latency 스윕을 `slow`로 표시할 것인가?
 - FSIM 레퍼런스를 하드코딩 상수에서 버전 관리되는 데이터 파일로 옮길 것인가?
@@ -0,0 +1,127 @@
 # ADR-0044: GEMM 평가 하니스 — `scripts/gemm_sweep.py` + `tests/gemm/`
 ## Status
 Accepted
 GEMM 평가/특성화 하니스를 문서화한다; 구현과 대조 검증 완료
 (상수, tile 크기, figure 집합, script↔test 분할을 교차 확인). D5/D6
 caveat은 부정확이 아니라 기록된 한계다.
 ## Context
 ADR-0014(PE pipeline)와 ADR-0042(tile-plan generator)는 GEMM *구현*을
 정의하고, ADR-0033은 latency 모델을 정의한다. 그러나 어느 것도 **GEMM
 성능을 어떻게 스윕하고 특성화하는가** — 타이밍 데이터를 만드는 shape/variant
 스윕과 이를 해석하는 figure — 는 기술하지 않는다. 본 ADR이 그 하니스를
 고정한다.
 allreduce 하니스(ADR-0043)와 달리 GEMM 스윕은 **무겁다**(24 sim 실행:
 8 shape × 3 operand-staging variant; `512` shape 하나가 2048 tile). 이
 무게가 아래 분할을 결정한다.
 ## Decision
 ### D1. 두 계층 분할 — 무거운 데이터 생성(script) vs. 빠른 figure(test)
 - **데이터 생성은 수동 script로 유지**: `scripts/gemm_sweep.py`가
  `matmul-composite`(ADR-0042 plan)를 CLI와 동일한 `run_bench` 경로로
  shape × variant에 걸쳐 실행하고, `result.engine.op_log`를 수확하여
  `docs/diagrams/gemm_sweep.json`(stage별/engine별 wall-clock + occupancy
  + record count + pe/composite window)을 쓴다.
 - **figure 렌더링은 test 생성**: `tests/gemm/`이 committed `gemm_sweep.json`을
  읽어 matplotlib PNG를 `docs/diagrams/gemm_plots/`에 렌더링한다. 이
  테스트는 빠르고 기본 실행된다.
 근거: 슬라이드덱 규모의 sim 스윕은 매 `pytest` 실행에 속하지 않지만,
 figure(저렴·결정적)는 자유롭게 재생성되고 CI로 가드되어야 한다. 이는
 CLAUDE.md의 script-vs-test 분할(무거운/수동 생성은 script; 빠른 assertion은
 test)을 반영한다.
 ### D2. Slow regenerator 테스트가 script를 감싼다
 `tests/gemm/test_gemm_sweep.py`는 `@pytest.mark.slow`로 표시된다(기본
 `addopts: -m "not slow"`에서 제외). 이는 `scripts/gemm_sweep.py`를
 subprocess로 호출하여 `gemm_sweep.json`을 on-demand로 재생성한다
 (`pytest -m slow tests/gemm/test_gemm_sweep.py`). 스윕 로직은 단일
 home(script)을 가지며 테스트는 이를 감싸기만 하므로 sim 구동 코드의
 중복이 없다.
 ### D3. Figure 집합 (3개 차트, `load_ref` variant)
 | 테스트 | PNG | 내용 |
 |---|---|---|
 | `test_plot_gemm_stage_breakdown.py` | `gemm_stage_breakdown.png` | stage별 engine wall-clock (DMA in / Fetch / GEMM / DMA out) |
 | `test_plot_gemm_mac_utilization.py` | `gemm_mac_utilization_measured.png` | GEMM util % + useful eff % |
 | `test_plot_gemm_mac_utilization.py` | `gemm_mac_utilization_theoretical_vs_measured.png` | theoretical vs 시뮬레이터-measured util/eff |
 `tests/gemm/_gemm_plot_helpers.py`가 공유 renderer를 보유한다(시리즈 로직은
 `scripts/build_overview_slides.py`의 GEMM `_render_*` 함수를 미러링하며,
 그쪽은 여전히 PPTX에 네이티브로 그린다). 수집되지 않음(`test_` 접두사
 없음). 각 `test_plot_*`는 `gemm_sweep.json`이 없으면 skip한다.
 ### D4. Tile 크기는 데이터 기반; under-tile shape는 표시
 Tile 크기는 `gemm_sweep.json`(`tile_sizes`)에서 읽으며, 이는 스윕이
 `PeSchedulerComponent.TILE_M/K/N = 32/64/32` — 권위 소스 — 에서 기록한
 값이다. `M<TILE_M ∨ K<TILE_K ∨ N<TILE_N`인 shape는 차트에
 ("under-tile") 표시된다. `512³` shape는 figure에서 제외된다
 (`EXCLUDED_SHAPES`).
 ### D5. Theoretical 모델 — 상속된 상수, 아직 ADR-미검증
 "theoretical" 곡선은 `scripts/build_overview_slides.py`에서 그대로 복사한
 상수로 해석적 ideal-pipeline 모델을 사용한다:
 ```
 HBM_GBS = 256.0   # GB/s        T_STAGE = 16.0 ns
 D_STAGES = 3                    BPE = 2
 ```
 **이 값들은 아직 ADR과 대조 소싱되지 않았다.** 특히 ADR-0033의 `256`은
 `burst_bytes`(256 B)로 이 `256 GB/s`와 *다른* 양이며, ADR-0033은
 대역폭을 `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`로 도출한다.
 `T_STAGE`/stage 수도 여기서 ADR-0014로 추적되지 않았다. 따라서 모델은
 **기존 deck script와 일관할 뿐 ADR과 검증되지 않았고**, 상수가 중복된다
 (deck + helper). 이를 조정(topology/ADR-0033/0014에서 소싱, 중복 제거)하는
 것은 보류 — Open questions 참조.
 ### D6. 알려진 네이밍 caveat — `_measured` 차트
 `gemm_mac_utilization_measured.png`는 현재 *theoretical* ideal-pipeline
 수치를 그린다(footnote가 그렇게 명시). 파일명만 "measured"라고 한다. 이는
 그 내용을 시뮬레이터-measured 시리즈로 재지정할지 또는 제목을 바꿀지
 결정을 보류 중인 알려진 misnomer다.
 ## Consequences
 ### Positive
 - GEMM figure가 allreduce처럼 test 생성·CI 가드된다.
 - 무거운 스윕은 opt-in으로 유지되어 기본 테스트 실행이 빠르다.
 - 스윕 로직의 단일 소스(script)를 slow 테스트가 재사용.
 ### Negative / limitations
 - theoretical 모델 상수(D5)는 미검증·중복이다.
 - `_measured` figure는 misnomer(D6).
 - `build_overview_slides.py`는 여전히 이 PNG를 임베드하지 않고
  `gemm_sweep.json`에서 GEMM 막대를 네이티브로 그린다 — test 아티팩트를
  소비하도록 deck를 재배선하는 작업은 미완.
 ## Dependencies
 - **ADR-0013**: verification strategy.
 - **ADR-0014 / ADR-0042**: PE pipeline + tile-plan generator — 스윕이
  측정하는 GEMM 구현; D4의 stage record count는 ADR-0042 D2/D3에서 온다.
 - **ADR-0033**: latency 모델 — D5 상수가 (아직은 아니지만) 추적되어야 할
  소스.
 - **ADR-0043**: 형제 격인 allreduce 평가 하니스.
 ## Open questions
 - D5 상수를 `topology.yaml` / ADR-0033 / ADR-0014와 대조 조정하고
  중복 제거할 것인가(모델 파라미터의 단일 소스)?
 - D6 `_measured` 네이밍 해결(내용 재지정 vs. 제목 변경)?
 - `build_overview_slides.py`를 네이티브 막대 그리기 대신 `gemm_plots/`
  PNG 임베드로 재배선할 것인가?
@@ -7,9 +7,10 @@ Accepted
 ## Context
 The `kernbench` CLI is the user-facing entry point of the simulator. It
-exposes three subcommands:
+exposes four subcommands:
 - `run` — execute a benchmark against a topology.
 - `list` — enumerate registered benches.
 - `probe` — diagnostic utility for latency / BW measurement.
 - `web` — interactive topology viewer.
@@ -33,8 +34,10 @@ Required arguments:
 - `--topology <path>`: topology YAML file path. Loaded via
  `resolve_topology()`.
- `--bench <name>`: benchmark name. Resolved via
+- `--bench <identifier>`: benchmark identifier. Resolved via
-  `benches.loader.resolve_bench()`.
+  `kernbench.benches.registry.resolve()`, which accepts either the
  registered kebab-case name (e.g., `gemm-single-pe`) or a numeric
  index from `kernbench list`.
 Optional arguments:
@@ -63,7 +66,21 @@ When `--device all` (or omitted) and the topology has multiple SIPs:
 The CLI does NOT spawn multiple OS processes or independent
 simulation runs — parallelism is internal to one simulation instance.
-### D4. `kernbench probe` — latency / BW diagnostic utility
+### D4. `kernbench list` — enumerate registered benches
 No arguments. Prints each registered bench's auto-assigned index,
 registered name, and one-line description.
 Benches register themselves via the `@bench(name=..., description=...)`
 decorator (`kernbench.benches.registry`). Every non-underscore module
 under `kernbench.benches/` MUST register at least one bench; a missing
 decorator raises `RuntimeError` at package import time.
 Indices are assigned alphabetically by name at import time. They are a
 CLI convenience (shorthand for `--bench`), not a stable API — a new
 bench inserted alphabetically will shift later indices.
 ### D5. `kernbench probe` — latency / BW diagnostic utility
 Required argument:
@@ -87,7 +104,7 @@ that local-HBM access ≤ cross-PE-within-cube ≤ cross-cube ≤
 cross-SIP — and reports violations. Probe is a developer tool for
 verifying the latency / BW model; it is not a benchmark.
-### D5. `kernbench web` — topology viewer
+### D6. `kernbench web` — topology viewer
 Optional arguments:
@@ -102,7 +119,7 @@ the browser. Distinct from the static `docs/diagrams/` artifacts:
 - `kernbench web` is interactive — pan/zoom, hover for component
  attributes, switch between SIP / CUBE / PE views.
-### D6. Runtime API and simulation engine remain device-scoped
+### D7. Runtime API and simulation engine remain device-scoped
 - Runtime API calls operate on one device per invocation.
 - The simulation engine schedules all requests deterministically.
@@ -112,6 +129,10 @@ This invariant keeps each layer testable in isolation; device
 enumeration and multi-device fan-out live only in the CLI's `run`
 command (D3).
 The `probe` implementation lives under `kernbench.probes` (separate
 from `kernbench.benches`), reflecting that probes are diagnostic
 utilities, not registered benches.
 ## Consequences
 - Benchmark authors write single-device logic; multi-device behavior
@@ -173,6 +173,37 @@ placement = resolve_dp_policy(
 No post-hoc `pe_index` shifting — ShardSpec carries the `(sip, cube, pe)`
 structural coordinates directly. ShardSpec details in ADR-0026.
 ### D5. SIP grid dimensions — explicit `sips.w/h` resolution
 For 2D inter-SIP topologies (`torus_2d`, `mesh_2d_no_wrap`) the SIP grid
 shape (width × height) is resolved from `system.sips.w` / `system.sips.h`,
 mirroring how D1 resolves `world_size` from `sips.count`. Precedence:
 explicit `w/h` (validated `w*h == count`) > square fallback
 (`round(sqrt(count))²`, used only when no `w/h` is given) > error.
 ```python
 sips = spec.get("system", {}).get("sips", {})
 if sip_topo == "ring_1d":
    w, h = 0, 0                          # 1D sentinel (no grid)
 elif sips.get("w") is not None and sips.get("h") is not None:
    w, h = int(sips["w"]), int(sips["h"])
    if w * h != n_sips:
        raise ValueError(f"sip layout {w}x{h} != sips.count ({n_sips})")
 else:
    side = int(round(math.sqrt(n_sips)))
    if side * side != n_sips:
        raise ValueError("non-square sips.count requires explicit sips.w/h")
    w, h = side, side
 ```
 This lifts the earlier assumption that 2D SIP grids must be perfect
 squares: a 6-SIP `torus_2d` / `mesh_2d_no_wrap` is now expressible as
 `w: 3, h: 2` (or `2x3`). The derived `(w, h)` feed the algorithm's
 inter-SIP exchange (consumed in ADR-0032 D5). The prior code path silently
 took `round(sqrt(count))²` for any non-ring topology, which produced a
 wrong grid (e.g. 2×2 for 6 SIPs); the explicit-`w/h` path with a
 fail-loud fallback replaces that.
 ---
 ## Dependencies
@@ -32,7 +32,7 @@ bandwidth characteristics for the common per-cube DP workload.
 ### Current state
- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
+- `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` — kernel
 - `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
 - `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this
  automatically at `init_process_group` time.
@@ -43,29 +43,46 @@ bandwidth characteristics for the common per-cube DP workload.
 ## Decision
-### D1. Algorithm structure — 5 phases
+### D1. Algorithm structure — 5 phases (center-root, bidirectional)
 The root cube sits at the geometric **center** of the cube mesh:
 ```
 root_col  = cube_w // 2
 root_row  = cube_h // 2
 root_cube = root_row * cube_w + root_col   # center; 10 on a 4×4 mesh
 ```
 Each reduce/broadcast phase converges/diverges **bidirectionally** toward
 this center, halving the intra-SIP critical path versus a corner-root walk
 (4×4 mesh: 4 hops reduce + 4 hops broadcast vs 6+6 with an SE-corner root).
 For each SIP (launched concurrently by `mp.spawn`):
 ```
-Phase 1 — Row reduce W → E (cube mesh, pe0 only):
+Phase 1 — Row reduce converging at col == root_col (cube mesh, pe0 only):
-    col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
+    left half (col < root_col) walks W→E; right half (col > root_col)
    walks E→W; the root_col cube merges both sides → holds row sum.
-Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
+Phase 2 — Col reduce on col == root_col converging at row == root_row:
-    row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
+    above (row < root_row) walks N→S; below (row > root_row) walks S→N;
-    holds the full SIP sum.
+    the root cube merges both → holds the full SIP sum.
-Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
+Phase 3 — Inter-SIP exchange on cube_id == root_cube (pe0 only):
    Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
    selected by sip_topo_kind (from topology.yaml sips.topology).
-Phase 4 — Col broadcast S → N on rightmost column.
+Phase 4 — Col broadcast on col == root_col, outward from root_row.
-Phase 5 — Row broadcast E → W across the cube mesh.
+Phase 5 — Row broadcast outward from root_col across the cube mesh.
 ```
 After all phases every cube's pe0 holds the global sum.
 **Single-cube fast-path**: when `cube_w == cube_h == 1` (one cube per rank,
 the common TP case), the intra-SIP reduce/broadcast phases are skipped and
 the kernel goes straight to the Phase 3 inter-SIP exchange.
 The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
 (ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
 across topologies; only phase 3 branches. Helper functions
@@ -121,20 +138,24 @@ system:
 ```
 - `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on
+- `torus_2d`: `w × h` wrapping mesh. Row ring on `global_E/W` then col
-  `global_E/W` then col ring on `global_S/N`.
+  ring on `global_S/N`.
- `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce +
+- `mesh_2d_no_wrap`: `w × h` mesh without wrap-around. Chain reduce +
  broadcast per dimension.
-2D variants require `n_sips` to be a perfect square.
+2D grid dims `(w, h)` come from `system.sips.w/h` (ADR-0024 D5). A square
 fallback (`round(sqrt(n_sips))²`) applies **only** when `w/h` are omitted,
 so rectangular grids (e.g. 6 SIPs as `3×2`) are supported by giving
 explicit `w/h`.
 ### D5. Process-group integration — `AhbmCCLBackend`
 At `init_process_group` time the backend:
 1. Loads `ccl.yaml` + `topology.yaml`.
-2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from
+2. Derives `sip_topo_kind` from `system.sips.topology` via the algorithm
-   `system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`.
+   module's `TOPO_NAME_TO_KIND`, and `sip_topo_w, sip_topo_h` from
   `system.sips.w/h` with a square-only fallback (ADR-0024 D5).
 3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
   SFR wiring, mirrors NCCL communicator creation.
@@ -154,17 +175,19 @@ At each `dist.all_reduce(tensor)` call:
 ```yaml
 defaults:
-  algorithm: intercube_allreduce
+  algorithm: lrab_hierarchical_allreduce
  buffer_kind: tcm
  ...
 algorithms:
-  intercube_allreduce:
+  lrab_hierarchical_allreduce:
-    module: kernbench.ccl.algorithms.intercube_allreduce
+    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
    topology: none
    buffer_kind: tcm
    n_elem: 8
-    root_cube: 15
+    root_cube: 15   # NOT read today — the kernel elects the root dynamically
                    # as the geometric center (see D1). Kept as a placeholder
                    # for a future explicit-root override / runtime election.
 ```
 `topology.yaml`:
@@ -203,13 +226,16 @@ Modules loaded via `cfg["module"]` must export:
 - **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
  workload for this algorithm is per-cube DP.
- **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and
+- **Square-grid fallback requires `n_sips = k²`**: rectangular SIP grids
-  `mesh_2d_no_wrap` require `n_sips = k²`.
+  (non-square mesh/torus) are supported, but only when `system.sips.w/h`
  are given explicitly (ADR-0024 D5). With `w/h` omitted, 2D topologies
  fall back to a square grid and still require `n_sips = k²`.
 - **Pipelined chunks**: single-tile per cube, no pipelining yet.
 - **Root cube runtime election**: the kernel currently uses
-  `root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
+  `root_cube = (mesh_h // 2) * mesh_w + (mesh_w // 2)` — the geometric
-  corner. SFR wiring covers all cubes, so runtime election is a pure kernel
+  center, chosen to minimize the intra-SIP critical path. SFR wiring
-  change when needed.
+  covers all cubes, so electing a different root at runtime is a pure
  kernel change when needed.
 ---
@@ -242,15 +268,14 @@ Modules loaded via `cfg["module"]` must export:
 | File | Change |
 |---|---|
-| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
+| `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
 | `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
 | `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
 | `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
 | `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
-| `ccl.yaml` | Single `intercube_allreduce` entry |
+| `ccl.yaml` | Single `lrab_hierarchical_allreduce` entry |
 | `topology.yaml` | Added `system.sips.topology` |
 | `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
-| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
+| `tests/sccl/` (test package) | Config-driven ring/torus/mesh correctness + full `dist.all_reduce` path + latency/buffer-kind sweeps (evaluation harness — ADR-0043) |
-| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
+| `tests/test_intercube_sfr_config.py` | SFR wiring verification |
 | `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
 | Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |
@@ -0,0 +1,139 @@
 # ADR-0038: PCIE_EP Component Model
 ## Status
 Accepted (2026-05-20).
 Companion to ADR-0035 (M_CPU), ADR-0036 (IO_CPU), and
 ADR-0037 (Forwarding) at the same component-model level.
 ## First action
 Pull one Transaction from `_inbox` and let `_forward_txn` invoke `run()`, which
 applies a single `env.timeout(node.attrs["overhead_ns"])` for PCIe protocol
 handling. After that the standard `ComponentBase` worker rules take over: if
 `next_hop` exists, put the advanced Transaction on `out_ports[next_hop]`;
 otherwise consume `drain_ns` and call `txn.done.succeed()`.
 In other words, **PCIE_EP's first (and only) act is to spend the configured
 overhead as simulator time** — no routing decisions, no payload transformation,
 no MMIO decoding.
 ## Context
 PCIE_EP is the **host ↔ device boundary** in the topology graph. The builder
 (`topology/builder.py`) creates an IO chiplet instance per SIP that contains
 `pcie_ep`, `io_cpu`, and `io_noc`, and lays bidirectional edges between the
 external `fabric.switch0` and each `pcie_ep`:
 - `switch → pcie_ep`: host → device traffic (MemoryWrite, MemoryRead,
  KernelLaunch).
 - `pcie_ep → switch`: device-side outbound (e.g., cross-SIP IPCQ tokens).
 Inside the IO chiplet there are bidirectional `pcie_ep ↔ io_noc` edges, and
 from there traffic branches to `io_cpu` or to the cube-side `hbm_ctrl` path
 (see ADR-0036 IO_CPU model). The router and resolver already know — per SPEC
 R7 — that PCIE_EP is the endpoint for memory operations, so helpers like
 `find_pcie_ep(sip)` and `find_memory_path(pcie_ep, dst_node)` treat PCIE_EP as
 the start (or end) of the memory path.
 The problem is that all of this dependency lives in builder/router/resolver,
 while **PCIE_EP's own internal model has no ADR**. The consequence:
 - "What latency does PCIE_EP model?" requires reading the source.
 - The asymmetry with peer components (IO_CPU = ADR-0036, M_CPU = ADR-0035) is
  awkward.
 - Future decisions about a more detailed PCIe link-layer model (TLP credits,
  retry, MPS chunking) lack a documented baseline.
 This ADR pins down the current **thin PCIE_EP model** and records that this
 thinness is intentional (aligned with ADR-0033's latency-model simplification
 policy).
 ## Decision
 ### D1. PCIE_EP uses ComponentBase's generic forwarding worker as-is
 `PcieEpComponent` extends `ComponentBase` and does **not** override `_worker` or
 `_forward_txn`. Every Transaction flows through the standard sequence:
 1. `_fan_in` accumulates inbound messages (and reassembles Flits, per ADR-0033
   Phase 2c) into `_inbox`.
 2. `_worker` pulls one message off `_inbox` and spawns
   `env.process(self._forward_txn(env, txn))` for per-message pipelining.
 3. `_forward_txn` calls the op_log start hook → `run()` for latency → op_log
   end hook.
 4. `run()` is a single line: `yield env.timeout(overhead_ns)`.
 5. If a next hop exists, `out_ports[next_hop].put(txn.advance())`. Otherwise
   (terminal arrival) consume `drain_ns` and call `txn.done.succeed()`.
 ### D2. The only timing parameter is `overhead_ns`
 Only `node.attrs["overhead_ns"]` is accepted as a latency parameter. The code
 default is `0.0`; `topology.yaml`'s IOChiplet `components.pcie_ep.attrs`
 supplies the real value (current topology: `overhead_ns: 5.0` ns).
 No separate BW-serialization resource (`simpy.Resource`), no queue depth, no
 retry model is introduced. Link-level BW serialization is handled wire-side —
 inside the IOChiplet by `pcie_ep_to_noc_bw_gbs = 256.0 GB/s`, and externally by
 the system's `io_ep_to_switch` link BW (ADR-0015 port/wire model). PCIE_EP
 itself takes no part in that accounting.
 ### D3. PCIE_EP is direction-aware in topology but direction-blind in code
 The builder lays both `switch ↔ pcie_ep` and `pcie_ep ↔ io_noc` edges, so
 PCIE_EP serves:
 - inbound (host → device): forward Transactions arriving from the switch onto
  io_noc-side next-hop.
 - outbound (device → host): forward Transactions arriving from io_noc/io_cpu
  back to the switch.
 Both are handled by D1's generic forwarding worker; the component code never
 distinguishes direction (it just follows `txn.next_hop`).
 ### D4. PCIE_EP is not Flit-aware (legacy reassembly path)
 `_FLIT_AWARE` is left at the inherited `False`, so `_fan_in` reassembles
 upstream-chunkified Flits into the parent Transaction before delivery to
 `_inbox` (aligned with ADR-0033 Phase 2c incremental rollout).
 A future PCIe TLP-level credit model would revisit D4.
 ### D5. PCIE_EP is a **named node** for routing helpers
 `policy/routing/router.py` provides `find_pcie_ep(sip, io_id="io0")`,
 `find_all_pcie_eps()`, and `find_memory_path(pcie_ep, dst_node)` — all of
 which treat PCIE_EP as the start (or end) of the memory path. The component
 itself supplies no information to these helpers; the naming convention
 (`sip{S}.{io_id}.pcie_ep`) is guaranteed by the topology builder.
 ## Alternatives Considered
 ### A1. Full PCIe TLP-level model (credits, retry, MPS chunking)
 Rejected. Violates ADR-0033's "current latency model = abstract overhead + BW
 serialization" simplification. Host↔device protocol fidelity is explicitly
 out-of-scope in SPEC §5 "Non-Goals".
 ### A2. Per-PCIE_EP `simpy.Resource` for in-flight cap
 Rejected. Host traffic is not a contention bottleneck in current workloads.
 Defer to a separate ADR if it becomes one (in which case D1 stays and D2 is
 extended).
 ### A3. Merge PCIE_EP into IO_CPU
 Rejected. PCIE_EP is the protocol-boundary node first hit on the host side;
 IO_CPU is the device-side control-plane processing node (ADR-0036). Traffic
 fan-out and command decoding costs concentrate in IO_CPU, while PCIE_EP only
 expresses link-edge overhead. Merging them would mix two responsibilities and
 violate the spirit of ADR-0007 (runtime API/sim_engine boundaries).
 ## Consequences
 - PCIE_EP gets an explicit model ADR despite having near-zero code — consistent
  with peer component ADRs, lower maintenance friction.
 - Future PCIe-level refinement supersedes by extending D2/D4 in a new ADR.
 - D5 makes the named-node dependency explicit, so any future renaming of
  component IDs has a clearly bounded blast radius.
@@ -0,0 +1,203 @@
 # ADR-0039: PE_MMU Component Model — Component + Utility Dual Role
 ## Status
 Accepted (2026-05-20).
 ADR-0011 (PA/VA/LA address model) only states that "the VA model translates
 VA→PA via PE_MMU"; this ADR pins down **the PE_MMU component's own behavior
 model**.
 ## First action
 At construction, read `node.attrs["page_size"]` (default `2 MiB`) and
 `node.attrs["tlb_overhead_ns"]` (default `0.0`) and instantiate the internal
 `PeMMU` utility object (`policy.address.pe_mmu.PeMMU`) exactly once. That
 object is the single owner of the page table, the sub-page region lists, and
 the TLB overhead value.
 At runtime the first action splits into two paths:
 - **Component path (inbox consumption)**: `_worker` pulls a Transaction off
  `_inbox`; if `request` is a `MmuMapMsg`, call `self._mmu.map(va, pa, size)`
  for each entry and then `txn.done.succeed()`. For `MmuUnmapMsg`, call
  `unmap(va, size)`. Any other type falls through to standard `_forward_txn`.
  In other words, **the component's first act is "apply map/unmap commands to
  the page table"**.
 - **Utility path (direct call)**: a sibling PE engine (PE_DMA / PE_GEMM) calls
  `pe_mmu.mmu.translate(va)` directly. This path produces no SimPy events;
  the caller (when `overhead_ns > 0`) issues a `yield env.timeout(mmu.overhead_ns)`
  in its own process.
 ## Context
 ADR-0011 defined three address models (PA/VA/LA) and agreed that "VA model =
 translation via PE_MMU". But in code, `PeMmuComponent` performs two
 complementary roles simultaneously:
 1. **A topology-graph component**: it receives `MmuMapMsg` / `MmuUnmapMsg`
   sideband messages over the cube NoC and updates the page table.
 2. **A PE-local utility**: PE_DMA / PE_GEMM on the same PE call
   `translate(va)` directly with zero SimPy latency (the caller pays
   `overhead_ns` if any).
 Without an ADR covering both roles, the following questions are ambiguous:
 - "Why isn't there a SimPy event for the MMU translate?" (Answer: the caller
  pays it.)
 - What is the sub-page region model, and why? (The code docstring has it, but
  no ADR — only a memory note `project_mmu_subpage_stopgap`.)
 - Who sends map/unmap, and when must they be visible? (Ordering contract.)
 Additionally, `PeMMU.map()` has "append, last-write-wins on overlap"
 semantics, which is impossible to express with a one-PA-per-entry page table.
 That is a deliberate **simulator stopgap** to support DPPolicy sub-page sharding
 (e.g., 128 B payloads against 4 KiB pages) without silent last-write-wins
 misrouting. This deviation from real HW MMU semantics must be ADR-pinned.
 ## Decision
 ### D1. Explicit dual role — component and utility
 `PeMmuComponent` exposes two interfaces from a single class:
 - Component interface: `_inbox` consumption, `_worker` loop (handles MMU
  sideband messages).
 - Utility interface: the `mmu` property exposes the underlying `PeMMU` object,
  which PE_DMA / PE_GEMM hold directly and invoke `translate()` on.
 The latter is **not a layer skip**: inside a PE, the engines and PE_MMU are
 siblings under the "components" layer (ADR-0007). Cross-layer violations only
 apply to runtime API ↔ sim_engine ↔ components boundaries.
 ### D2. Latency model — `translate()` is pure; caller owns the timeout
 `PeMMU.translate()` is a pure function and yields nothing in SimPy. The caller
 (a PE engine) issues `if mmu.overhead_ns > 0: yield env.timeout(mmu.overhead_ns)`
 in its own process after translation.
 Rationale: the PE engine process already holds its own `record_start` /
 `record_end` (op_log) hooks, so keeping timing inside the caller's process
 preserves consistent timing accounting. A separate MMU process would split the
 engine's processing flow and blur op_log / pipeline overlap semantics.
 #### D2.1. Current implementation asymmetry — pipeline vs non-pipeline (known)
 At the time of writing, `pe_dma.py` handles MMU overhead differently in its
 two call paths:
 - **non-pipeline (`handle_command`)**: after `translate()`, applies
  `if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)`.
 - **pipeline (`_do_pipeline_dma`)**: calls `translate()` only, **omitting**
  the overhead timeout — though the comment says "same logic as non-pipeline
  path", the behaviors differ.
 In the default topology, `tlb_overhead_ns = 0.0`, so this asymmetry does not
 manifest. With `tlb_overhead_ns > 0`, however, GEMM/Math via the pipeline path
 appears MMU-overhead faster than the equivalent non-pipeline workload.
 The D2 contract states that **all** callers pay the overhead; the pipeline
 omission is **not an intentional design** — ADR-0014 D6 (pipeline self-routing)
 does not exempt it. Remediation options (require a separate Phase 1/2):
 - (a) Add `if mmu.overhead_ns > 0: yield env.timeout(...)` in
  `_do_pipeline_dma` to align with D2 — **preferred**.
 - (b) Narrow the D2 contract to "non-pipeline only" and document the pipeline
  exemption in an ADR-0014 update — discouraged, since it weakens the
  overhead's meaning.
 This ADR recommends (a) and assumes a small follow-up change either before or
 just after acceptance.
 ### D3. Page table structure — sub-page region list (stopgap)
 `self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]]`
 holds multiple disjoint regions per page.
 - `map(va, pa, size)`: append regions when the range crosses a page boundary.
 - `translate(va)`: look up regions for the VPN and iterate **in reverse** so
  the most recent overlapping region wins (last-write-wins).
 - `unmap(va, size)`: remove only regions whose extent is **fully contained**
  within the unmap range; partial-overlap boundaries are left in place and the
  caller is expected to unmap on the same boundaries used for map.
 This is documented as a **simulator stopgap** that supplements the VA model
 from ADR-0011. It prevents silent last-write-wins misrouting when DPPolicy
 shards below page granularity. Memory note: `project_mmu_subpage_stopgap`.
 ### D4. PageFault signals PA fallback
 If `translate()` is called with an unmapped VA, `PageFault` is raised. PE_DMA
 catches the exception and **uses the original address as a PA** (the PA-only
 backward-compatibility path from ADR-0011). PageFault is therefore not an
 error — it is the signal for "no VA mapping, interpret as PA".
 This path is intentional and preserves backward compatibility with the
 ADR-0011 PA-only mode.
 ### D5. MMU sideband-message reception contract
 `MmuMapMsg` / `MmuUnmapMsg` arrive over the fabric at PE_MMU's `_inbox`
 (SPEC R10: "MMU map installation incurs measured fabric latency"). Schemas
 live in `runtime_api/kernel.py`:
 - `MmuMapMsg.entries: tuple[dict, ...]` — each dict is
  `{"va": int, "pa": int, "size": int}`.
 - `MmuUnmapMsg.entries: tuple[dict, ...]` — each dict is
  `{"va": int, "size": int}`.
 PE_MMU reception flow:
 1. `_worker` does `_inbox.get()` for one message.
 2. `hasattr(msg, "request")` confirms a Transaction wrapper.
 3. `isinstance(msg.request, MmuMapMsg)` → for each entry, call
   `self._mmu.map(va=e["va"], pa=e["pa"], size=e["size"])`.
 4. `isinstance(msg.request, MmuUnmapMsg)` → for each entry, call
   `self._mmu.unmap(va=e["va"], size=e["size"])`.
 5. Both signal `msg.done.succeed()` after completion.
 An external caller (runtime API) `await`ing `done` therefore receives a SimPy
 guarantee that "the mapping is installed on-device" — this is the realization
 of ADR-0011's "MMU map installation incurs measured fabric latency".
 This ADR does **not** define the **sender or fan-out policy** for the sideband
 message — those are runtime API responsibilities. Only the receive contract
 belongs here.
 ### D6. Non-MMU Transactions delegate to generic forwarding
 If a message pulled from `_inbox` is not `MmuMapMsg` / `MmuUnmapMsg` (or
 lacks a `request` attribute), `_forward_txn` handles it normally. This keeps
 the door open for future topologies where PE_MMU sits on a pass-through path —
 current code never sends such traffic, but the routing remains safe.
 ## Alternatives Considered
 ### A1. Make `translate()` a SimPy generator
 Rejected. As D2 explains, this blurs op_log / pipeline overlap accounting in
 the PE engine.
 ### A2. Use small page size (e.g., 128 B) instead of sub-page regions
 Rejected. Would explode page-table memory and cube-wide map message size. Most
 mappings are 2 MiB; pushing the page size below that for the few DPPolicy
 sharding cases inflates average cost.
 ### A3. Make PE_MMU a PE_CPU helper only (not a topology node)
 Rejected. ADR-0011 requires that MMU map installation incur measured fabric
 latency (via `MmuMapMsg`), which requires PE_MMU to be a node on the graph.
 It also keeps cube NoC visualizer output consistent.
 ## Consequences
 - PE_MMU's dual role is justified at ADR level, so future "unify into one"
  refactor pressure has a documented counterpoint.
 - The sub-page region model is explicitly labeled a stopgap, providing a
  basis for deprecating it when LA model (ADR-0011) lands.
 - The "`translate()` does not yield" contract is locked in (D2), so any
  future proposal to add an internal MMU timeout can be denied with a
  documented rationale.
 - PA fallback (D4) is normalized, preventing defensive logic from treating
  PageFault as an error.
@@ -0,0 +1,149 @@
 # ADR-0040: PE_TCM Component Model — Dual-Channel BW Serialization
 ## Status
 Accepted (2026-05-20).
 ADR-0014 (PE Pipeline Execution Model, D1) references PE_TCM as a "BW-based
 serialized scratchpad memory" but does not pin down the component's own model.
 This ADR fills that gap.
 ## First action
 When `start()` is invoked, immediately create two `simpy.Resource(env, capacity=1)`
 instances and store them in `self._read_res` / `self._write_res`. These two
 resources are the single decision points that serialize the **read channel**
 and **write channel** to one in-flight request each.
 The runtime first action: `_worker` pulls a message off `_inbox` and branches
 by type:
 - `TcmRequest` (from `pe_fetch_store`): spawn `env.process(self._handle_tcm_request)`.
  Hence **TCM's first act is "acquire the lock matching the direction
  (read/write)"**. After lock acquisition, if `bw > 0 and nbytes > 0`, yield
  `env.timeout(delay_ns = nbytes / bw)`, then `req.done.succeed()`.
 - Anything else (Transaction): spawn `env.process(self._forward_txn)` (legacy
  fabric pass-through).
 At construction, `node.attrs["read_bw_gbs"]` and `node.attrs["write_bw_gbs"]`
 (default `512.0 GB/s` each) are captured and held.
 ## Context
 In the PE pipeline (ADR-0014 D1, D6), PE_TCM receives two kinds of traffic:
 1. **`TcmRequest` from PE_FETCH_STORE** — when moving data between TCM and
   the register file, PE_FETCH_STORE sends a short sideband request to obtain
   BW-serialized access latency (`direction = "read"` or `"write"`, `nbytes`,
   `done` event).
 2. **Legacy Transaction forwarding** — a fallback in case TCM ends up as a
   pass-through node on the fabric graph (not used by the current critical
   path, but preserved).
 The problem: ADR-0014 only says "BW-based serialization" without specifying:
 - Read and write are **independent channels** running in parallel; only
  same-direction concurrency serializes at `capacity=1`.
 - BW is split into two configurable values (`read_bw_gbs` / `write_bw_gbs`).
 - The formula is `delay_ns = nbytes / bw_gbs` (loose unit convention:
  GB/s × ns ≈ B).
 - `nbytes == 0` still acquires the lock but skips the BW term.
 - `run()`'s `overhead_ns` (default `0.0`) is only used in the legacy fabric
  forwarding path.
 Each of these requires an ADR. In particular, "why are read and write
 separate channels" and "who owns the BW values" must be documented so that
 future changes (e.g., `capacity=2`) have a clear basis.
 ## Decision
 ### D1. Dual channel — read and write are independent resources
 `_read_res = simpy.Resource(env, capacity=1)`,
 `_write_res = simpy.Resource(env, capacity=1)`.
 Same-direction concurrent requests queue on the resource and serialize;
 opposite-direction requests proceed in parallel. This matches the hardware
 model where TCM has a dual-port (read + write) configuration, and it allows
 the simulator to express the GEMM-pipeline case where fetch (read) and store
 (write) overlap in time — modeled as BW-serialized inside each direction but
 independent across directions.
 ### D2. Per-channel BW model — `nbytes / bw_gbs`
 After lock acquisition, if `nbytes > 0 and bw > 0`, yield
 `env.timeout(nbytes / bw_gbs)`. The unit convention is GB/s × ns ≈ B,
 consistent with the simulator-wide loose convention (see ADR-0033).
 - `nbytes == 0`: BW term is zero, but the lock is acquired and released. This
  is intentional: when a plan generator emits an empty fetch/store on the
  PE_FETCH_STORE side, the op_log / channel accounting on the TCM side still
  records one consumption.
 - `bw == 0` (config error): the timeout call is skipped (0-time pass). Should
  not occur with normal settings.
 ### D3. BW values come from `node.attrs.read_bw_gbs` / `write_bw_gbs`
 Defaults `512.0 GB/s`. The topology builder (`topology/builder.py`) passes
 these attrs when instantiating TCM from `pe_template`. Default changes should
 coincide with related decisions in ADR-0014 D1 or ADR-0033.
 ### D4. TcmRequest schema is owned by PE_TCM
 `@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
 lives in `components/builtin/pe_tcm.py`. PE_FETCH_STORE imports the dataclass
 and only constructs/sends it. The caller does not define the schema because:
 - The meaning of BW serialization is TCM's responsibility — TCM decides which
  fields drive serialization.
 - The valid-value check for `direction` (must be `"read"` or `"write"`) lives
  in `_handle_tcm_request`'s if/else branch.
 ### D5. Legacy Transaction forwarding path is preserved
 When `_worker` receives a non-`TcmRequest` message, it dispatches to
 `_forward_txn`, applying `run()`'s `overhead_ns`. The current standard PE
 pipeline does not route Transactions through TCM, but the path is kept to
 avoid breakage if fabric topology changes.
 This path is accounted for via standard Transaction op_log; the BW channel
 locks are **not** acquired (orthogonal to D1's usage).
 ### D6. PE_TCM is not a data store (timing only)
 TCM models **time only**. The actual data payload is held by sim_engine's
 `memory_store` (when present); the TCM component never updates it.
 PE_FETCH_STORE obtains BW delay through `TcmRequest`, and register contents
 are handled separately in the data path (ADR-0020 2-pass data execution —
 Phase 2).
 ## Alternatives Considered
 ### A1. Single channel (`capacity=2` for shared read+write)
 Rejected. Would artificially serialize the normal-case overlap of fetch
 (read) and store (write) and yield an incorrect BW upper bound for the PE
 pipeline.
 ### A2. `capacity > 1` (e.g., 2-banked TCM)
 Rejected. Current hardware model assumes a single bank. Multi-bank extension
 needs its own ADR that would supersede D1. Bumping capacity now would loosen
 the nominal serialization without raising the BW upper bound, producing less
 accurate modeling.
 ### A3. Generalize BW formula to `nbytes / bw + overhead_ns`
 Rejected. `overhead_ns` is reserved for the legacy forwarding path (D5).
 Additional fetch/store-path overhead, if needed, belongs in PE_FETCH_STORE's
 `run()` or in a register-file access model — closer to the responsibility
 boundary.
 ## Consequences
 - TCM's BW accounting is locked at ADR level. Questions arising from op_log
  in GEMM/Math sweeps — "why did fetch and store overlap?", "why do only
  same-direction requests serialize?" — resolve quickly to D1.
 - Future multi-bank TCM models or asymmetric read/write BW changes have a
  clear blast radius (D1 / D2 / D3 — pick one).
 - D6 ("TCM is not a data store") sharpens the responsibility boundary with
  ADR-0020 2-pass execution.
@@ -0,0 +1,195 @@
 # ADR-0041: Cube SRAM Component Model — terminal scratchpad on cube NoC
 ## Status
 Accepted (2026-05-20).
 ADR-0017 (Cube NOC and HBM Connectivity) describes SRAM as a cube-NoC
 attachment but does not specify the SRAM component's own latency / response
 model. This ADR fills that gap.
 ## First action
 Inside `_worker`, immediately after pulling a Transaction off `_inbox`, the
 very first action is `yield from self.run(env, txn.nbytes)`. Inside `run()`,
 the component applies `env.timeout(node.attrs["overhead_ns"])`
 (default `0.0`).
 In short, **SRAM's first act is "express access overhead as simulator time"**.
 After overhead, the worker yields `drain_ns` (the terminal BW-serialization
 cost stamped on the Transaction) and then constructs and dispatches a
 `ResponseMsg` on the reverse path.
 This differs from a generic `ComponentBase._worker`: SRAM knows it is a
 **terminal node**, so it does not go through `_forward_txn`. Its own worker
 explicitly performs `run → drain → _send_response`.
 ## Context
 The cube topology (`topology/builder.py`) creates the following named nodes
 per cube:
 - `sip{S}.cube{C}.m_cpu`
 - `sip{S}.cube{C}.sram`
 - `sip{S}.cube{C}.hbm_ctrl` (per-PE partitions)
 - `sip{S}.cube{C}.pe{P}` (and its PE-internal sub-components)
 SRAM is one of the cube-NoC attachments — `topology/mesh_gen.py` assigns it
 to the nearest router by placement coordinates and adds `"sram"` to that
 router's `attach` list. The builder lays bidirectional `sram ↔ router` edges
 (BW: `sram_to_router_bw_gbs`, default `128.0 GB/s`).
 SRAM has two intertwined roles:
 1. **Fabric terminal**: the endpoint for cube-NoC memory-access Transactions
   destined for SRAM. SRAM consumes access overhead + drain, then sends a
   response back on the reverse path.
 2. **One of the IPCQ slot tiers**: ADR-0023 D9.7 defines
   `buffer_kind ∈ {tcm, sram, hbm}`; the `sram` tier's per-access cost is
   `(512.0 GB/s, 2.0 ns)` in `common/ipcq_types._BUFFER_KIND_BW`. This is
   separate from the SRAM node's `overhead_ns` attr; PE_DMA accounts for it
   directly at the IPCQ slot-write moment.
 Without an ADR covering both roles, the following questions are ambiguous:
 - "What latency does SRAM model?" — fabric drain + overhead, or the IPCQ
  tier slot latency? — answers scatter.
 - What does the `size_mb` (`32`) attr mean in the future? Currently it is not
  used; SRAM only models timing.
 - Which cube router does SRAM attach to? (placement-based; lives in topology
  code only.)
 ## Decision
 ### D1. SRAM is a terminal scratchpad node on the cube NoC
 `SramComponent` extends `ComponentBase` but overrides `_worker` to express
 terminal semantics directly:
 ```
 while True:
    txn = yield self._inbox.get()
    yield from self.run(env, txn.nbytes)     # overhead_ns
    if drain_ns > 0: yield env.timeout(drain_ns)
    yield from self._send_response(env, txn)
 ```
 This pattern is necessary because SRAM must know the reverse path; the
 generic `_forward_txn` (which forwards to the next hop) does not fit a
 terminal.
 #### D1.1. Currently dormant — the `_worker` override is an unused path
 At the time of writing, **no component actually sends a Transaction to the
 SRAM node**. The verified references to the SRAM node ID are:
 - `policy/routing/router.py` and friends — guarantee path lookups.
 - `components/builtin/pe_dma.py::_handle_ipcq_inbound` — for
  `buffer_kind == "sram"`, computes the *path* to
  `bank_node = f"{cube_prefix}.sram"` via `compute_drain_ns(path, ...)` and
  yields a **local** timeout. The Transaction itself does not flow to the
  SRAM node (see D4).
 - `tests/test_routing.py` — checks connectivity via
  `find_path("sip0.cube0.pe0", "sip0.cube0.sram")`.
 So the `_worker` / `_send_response` override is currently a **dormant code
 path**. It is preserved deliberately:
 - Topology changes that route fabric Transactions to SRAM terminally (e.g.,
  explicit M_CPU → SRAM accesses) would activate it immediately.
 - ADR-0017's "cube-attached scratchpad" semantics naturally implies terminal
  behavior; the override is an intentional placeholder.
 A future ADR (or a revision to this one) will mark dormancy resolved when an
 actual sender is added.
 ### D2. ResponseMsg construction and reverse-path dispatch
 `_send_response`:
 1. `reverse_path = list(reversed(txn.path))` — derive the reverse path.
 2. Construct `ResponseMsg(correlation_id=txn.request.correlation_id,
   request_id=..., src_cube=<this cube>, src_pe=-1, success=True)`.
 3. Wrap in `Transaction(request=resp_msg, path=reverse_path, step=0,
   nbytes=0, done=env.event(), is_response=True)` and put on
   `out_ports[reverse_path[1]]`.
 4. If the reverse path is too short (`< 2 hops`) or `ctx` is absent, fall
   back to calling the original `txn.done.succeed()`.
 `src_pe = -1` means "SRAM is not PE-localized". `src_cube` is parsed from the
 node ID (`sip{S}.cube{C}.sram`).
 ### D3. Timing parameters: `overhead_ns` and wire-side `drain_ns`
 - **Component-side latency**: `node.attrs["overhead_ns"]`. Default topology
  uses `2.0 ns`.
 - **Link-side serialization**: `drain_ns` arrives stamped on the Transaction
  — the wire-side BW serialization result from ADR-0015. SRAM only yields it.
 - The `size_mb` (default `32 MiB`) attr is currently timing-neutral. If a
  capacity-aware model is added in the future, a separate ADR will give it
  meaning.
 ### D4. IPCQ slot accounting is not modeled by the SRAM component
 Per ADR-0023 D9.7, the IPCQ slot-write latency for the SRAM tier is incurred
 inside PE_DMA's `_handle_ipcq_inbound`, which calls
 `slot_io_latency_ns("sram", nbytes)` using `_BUFFER_KIND_BW["sram"]`. That is:
 - When SRAM receives a fabric Transaction (D1, D2, D3 apply), it processes
  normally.
 - When an IPCQ slot lives on SRAM, PE_DMA pays the slot-write time directly —
  independent of the SRAM component.
 This separation is intentional: IPCQ is a fast path (sub-cycle slot
 bookkeeping) and does not traverse fabric Transactions, so SRAM does not need
 to know about IPCQ.
 ### D5. SRAM's cube-NoC attachment is placement-driven
 `topology/mesh_gen.py` reads `placement.sram.pos_mm` (default `[1.5, 9.0]` in
 `topology.yaml`) and adds `"sram"` to the nearest router's `attach`. The
 builder (`topology/builder.py`'s attachment loop) then lays bidirectional
 edges between the `sram` node and that router.
 This decision lives outside the SRAM component (mesh_gen / builder); the
 component does not know which router it sits on. It only relies on
 `txn.path` / `reverse_path` to reach it via a router.
 ### D6. SRAM is not a data store (timing only)
 Same context as ADR-0040 D6: the SRAM component models time only; the data
 payload (if any) lives in sim_engine's `memory_store`.
 ## Alternatives Considered
 ### A1. Use `_forward_txn` and route responses via separate nodes (à la IO_CPU / HBM_CTRL)
 Rejected. SRAM is a terminal on the cube NoC; adding a response node would
 introduce meaningless hops and violate ADR-0017's simplification spirit.
 ### A2. Model BW serialization inside SRAM with its own resource
 Rejected. Wire-side BW serialization (`drain_ns`) already captures it. An
 internal `simpy.Resource` would double-count against ADR-0015 (port/wire
 model).
 ### A3. Handle IPCQ slot accounting in the SRAM component
 Rejected. As D4 makes explicit, IPCQ is a fast path that does not traverse
 fabric Transactions. If SRAM knew about IPCQ, the responsibility would split
 across two places and obscure reasoning.
 ### A4. Capacity-aware latency from `size_mb`
 Rejected for now. The capacity is currently a visualizer label; introducing
 a capacity-aware timing model requires a dedicated ADR.
 ## Consequences
 - SRAM's timing model is pinned at ADR level as
  `overhead_ns + drain_ns + ResponseMsg(reverse_path)`. Any proposal to push
  IPCQ slot latency into the SRAM component can be refused with D4.
 - D3 records that `size_mb` is timing-neutral today, so a future
  capacity-aware model has a narrow compatibility scope.
 - D5 documents the placement-driven attachment, so changes to the SRAM
  coordinate have a clearly bounded impact (`mesh_gen` only).
@@ -0,0 +1,199 @@
 # ADR-0042: Tile Plan Generators — GEMM/Math Pipeline Plan Builders
 ## Status
 Accepted (2026-05-20).
 This ADR pins down `tiling.py` as a **plan-generator
 module**, not a SimPy component.
 ADR-0014 (PE Pipeline Execution Model) D6 (tile plan / self-routing) does not
 specify the tile-plan generation algorithm itself; this ADR fills that gap.
 ## First action
 When `generate_gemm_plan(M, K, N, tile_m, tile_k, tile_n, ..., pe_prefix,
 a_pinned, b_pinned, epilogue_specs)` is called, the very first action is
 **computing tile counts and constructing the PE-component ID strings**:
 ```
 M_tiles = max(1, ceil(M / tile_m))
 K_tiles = max(1, ceil(K / tile_k))
 N_tiles = max(1, ceil(N / tile_n))
 dma_id   = f"{pe_prefix}.pe_dma"
 fetch_id = f"{pe_prefix}.pe_fetch_store"
 gemm_id  = f"{pe_prefix}.pe_gemm"
 math_id  = f"{pe_prefix}.pe_math"
 ```
 In short, **the plan generator's first act is "compute ceiling tile counts
 and assemble the four sub-component IDs for this PE once"**. No SimPy event
 or environment is touched — this module is a pure function.
 `generate_math_plan(M, N, tile_m, tile_n, ..., math_op, src_addr, dst_addr,
 pe_prefix)` likewise begins by computing `M_tiles`, `N_tiles` and assembling
 three component IDs (`dma_id`, `fetch_id`, `math_id`).
 ## Context
 ADR-0014 D6 agreed that "PE_SCHEDULER, on receiving a CompositeCmd, generates
 a TilePlan and feeds self-routing tile tokens". But the **concrete plan
 generation algorithm** lives in `src/kernbench/components/builtin/tiling.py`,
 which:
 - Defines no component — it is a pair of **pure functions**
  (`generate_gemm_plan`, `generate_math_plan`).
 - Does not depend on the SimPy environment, queues, op_log, or hooks.
 - Returns a `PipelinePlan` (dataclass).
 The original G4 analysis incorrectly described `tiling.py` as a component;
 it is in fact a plan-builder helper consumed by PE_SCHEDULER. Pinning this
 down in its own ADR (paired with ADR-0014 D6) prevents:
 - Ambiguity over whether plan generation belongs to PE_SCHEDULER or a
  separate module.
 - Inconsistent rationale for stage sequences (e.g., FETCH/STORE position)
  between GEMM and Math plans.
 - Undocumented branching rationale for `a_pinned` / `b_pinned` /
  `epilogue_specs`.
 ## Decision
 ### D1. `tiling` is a pure plan-generator module, not a component
 `components/builtin/tiling.py` defines no `ComponentBase` subclass. It exports
 two module-level functions:
 - `generate_gemm_plan(...) -> PipelinePlan`
 - `generate_math_plan(...) -> PipelinePlan`
 There is no `tiling` node in the topology graph. It lives in `builtin/`
 because it is a direct helper for PE_SCHEDULER (ADR-0014 D6) and is
 conceptually a PE_SCHEDULER internal utility.
 ### D2. GEMM plan stage sequence — `M → N → K` order
 For each `(m, n, k)` tile (default — no operand pinning, no epilogue):
 ```
 [DMA_READ(A)] → [DMA_READ(B)] → FETCH → GEMM
                                ↑
                                ↓
 (last k tile only)              [MATH(output_tile)]* → STORE → DMA_WRITE
 ```
 `k_tile` epilogue inserts a MATH stage immediately after GEMM on every
 K-tile; `output_tile` epilogue inserts MATH stages once per `(m, n)` after
 the final K-tile but before STORE/DMA_WRITE. The K-loop accumulator stays
 in the register file across K-tiles — STORE/DMA_WRITE happens only when
 `last_k`.
 ### D3. Operand pinning — `a_pinned` / `b_pinned`
 If a caller passes `a_pinned=True`, **the A DMA_READ is omitted from every
 (m, n, k) tile**. Semantically: the caller (e.g., `tl.composite`) has already
 staged all of A in TCM via a prior `tl.load`, and signals so to the plan
 generator.
 The branch is made at plan time (not at runtime). Therefore the stage record
 count in op_log changes deterministically with pinning, and sweep analyses
 (e.g., gemm_sweep's stage record count) see this decision directly.
 ### D4. Epilogue scope — `k_tile` vs `output_tile`
 `epilogue_specs` is an iterable of op-spec objects. Each op object is
 expected to have:
 - `op.kind: str` — math op name (e.g., `"dequant"`, `"bias"`, `"relu"`,
  `"scale"`). Placed into the stage's `params["op_kind"]`.
 - `op.scope: Scope` — `Scope.K_TILE` or `Scope.OUTPUT_TILE` (`Scope` enum
  in `kernbench.common.pe_commands`).
 - Op-specific extras (e.g., `bias`, `scale`, `factor`) — currently not used
  by the plan generator; consumed at runtime by PE_MATH.
 The plan generator partitions by `getattr(o, "scope", None)`:
 - `scope == Scope.K_TILE`: adds a MATH stage right after GEMM on every K-tile.
 - `scope == Scope.OUTPUT_TILE`: adds MATH stages just before STORE on the
  last K-tile per `(m, n)`.
 Ops with neither `scope` value (e.g., missing attribute) are **dropped
 silently** — `getattr(..., None) == Scope.X` is False for both. Picking a
 default (`output_tile`) is the **caller's responsibility** (e.g.,
 `tl.composite`), not the plan generator's. This aligns with ADR-0014's
 composite epilogue contract.
 `Scope` is imported lazily inside the function to avoid the circular path
 `pe_commands ← pe_types ← tiling`. This is intentional and not a refactor
 target — keeping `tiling` free of compile-time `pe_commands` dependencies
 preserves the module boundary (D1).
 ### D5. Math plan stage sequence — `M → N` order
 For each `(m, n)` tile:
 ```
 DMA_READ → FETCH → MATH → STORE → DMA_WRITE
 ```
 There is no K dimension, so concepts like epilogue or accumulator residency
 do not apply. PE_FETCH_STORE's register-file accounting follows the same
 pattern as the GEMM plan.
 ### D6. Plans are data — no SimPy dependency
 `PipelinePlan` is a dataclass in `pe_types.py` holding `tiles:
 list[TilePlan]`. Each `TilePlan` holds `stages: tuple[Stage, ...]`. The plan
 itself is near-immutable (only `Stage.params: dict` is mutable) and holds no
 SimPy objects.
 At runtime, PE_SCHEDULER consumes the plan's first stage, builds a `TileToken`,
 and feeds it into the pipeline. The TileToken carries `plan: TilePlan`,
 `stage_idx: int`, and a cached `params: dict`. Self-routing proceeds by
 `TileToken.advance()` caching the next stage's `params` (ADR-0014 D6).
 ### D7. Plan generator contract — pure, deterministic, idempotent
 Two calls with identical inputs return identical `PipelinePlan` instances
 (including `TilePlan.stages` order). This contract aligns with ADR-0014 D6's
 "deterministic tile dispatch order".
 No side effects (no SimPy events, no file I/O, no global state) — tests can
 call the generators directly without an environment object (some cases in
 `tests/test_pe_pipeline.py` rely on this).
 ## Alternatives Considered
 ### A1. Make tiling a component (e.g., PE_PLANNER)
 Rejected. Plan generation consumes no SimPy time — it is a pure decision
 algorithm. Making it a component would (a) add unnecessary infrastructure
 (inbox, resources), and (b) split PE_SCHEDULER's flow into "receive plan"
 plus "feed tiles", inserting a meaningless hop.
 ### A2. Move plan generation into PE_SCHEDULER as methods
 Rejected (currently). Module separation provides (1) testability and
 (2) extensibility for additional plan algorithms (e.g., DTensor-aware) —
 add a new function. If plan kinds proliferate enough to require explicit
 dispatch, a future ADR can introduce a plan factory on PE_SCHEDULER.
 ### A3. Make plans fully immutable (frozen dataclass + tuple)
 Partially adopted. `Stage` and `TilePlan` are dataclasses but not frozen,
 because `Stage.params: dict` is populated at plan-generation time and read
 at runtime (cached by TileToken on advance). Moving dict → frozendict pays
 migration cost without enough benefit. Convention: do not mutate after
 generation.
 ## Consequences
 - `tiling.py` is documented as a plan-generator module, not a component —
  preempting future G4-style "this component lacks an ADR" analyses.
 - The GEMM plan's stage sequence (D2) and pinning / epilogue branching
  (D3 / D4) are pinned, providing a clear interpretation basis for sweep
  analyses (e.g., `scripts/gemm_sweep.py`'s stage record counts).
 - The plan generator's pure contract (D7) enables environment-free testing
  in line with ADR-0013 (verification strategy).
 - Future plan kinds (DTensor-aware, K-major, ...) follow D1 / D6 / D7 as a
  baseline — just add a new function.
@@ -0,0 +1,130 @@
 # ADR-0043: Allreduce Evaluation Harness — `tests/sccl/`
 ## Status
 Accepted
 Documents the `tests/sccl/` evaluation harness; verified against the
 implementation (constants, file set, and sweep dimensions cross-checked).
 ## Context
 ADR-0032 defines the intercube all-reduce *algorithm*; ADR-0023/0024/0027
 define the IPCQ backend, the rank=SIP launcher, and `mp.spawn`. None of
 them describe **how the allreduce is exercised and characterized** — the
 correctness tests, the latency/buffer-kind sweeps, and the derived plots.
 ADR-0013 (verification strategy) is the general policy; this ADR pins the
 concrete allreduce harness so the "evaluation" half of the work is
 documented, not just the implementation.
 The harness lives under `tests/sccl/` (the package created when the
 allreduce tests were consolidated). It supersedes the earlier flat
 `tests/test_allreduce_multidevice.py` + `tests/test_distributed_*` layout.
 ## Decision
 ### D1. Drive evaluation through the public `torch.distributed` path
 Correctness and the sweeps run the collective through the real DDP-shaped
 path — `init_process_group(backend="ahbm") → mp.spawn → dist.all_reduce`
 (ADR-0024/0027) — not the lower-level `ctx.launch`. A shared helper
 `_run_distributed(tmp_path, monkeypatch, topo_path, corr_id, n_elem)` in
 `tests/sccl/_allreduce_helpers.py` builds the engine, runs the workers, and
 returns `(engine, n_cubes)`. `monkeypatch.chdir` points the backend's
 `load_ccl_config()` (cwd lookup) at a per-case temp `ccl.yaml`.
 A direct-launch reference (`run_allreduce`) is retained in the same helper
 module — not for the distributed tests, but because the IPCQ buffer-kind /
 root-center micro-tests under `tests/` import it.
 ### D2. One file per evaluation concern
 | File | Concern | `torch.distributed`? |
 |---|---|---|
 | `test_allreduce_ring_torus_mesh.py` | correctness across ring_1d / torus_2d (2×3) / mesh_2d_no_wrap (2×3) | yes |
 | `test_distributed_default_topology.py` | full path on `topology.yaml` as-is | yes |
 | `test_plot_latency_sweep.py` | latency sweep rows (n_elem × topology) | yes |
 | `test_plot_buffer_kind_sweep.py` | TCM/SRAM/HBM sweep rows | yes |
 | `test_plot_topology_diagram.py` | topology.png (pure matplotlib) | no |
 | `test_plot_comparison_fsim.py` | broken-axis model-vs-FSIM comparison | no |
 | `test_intercube_root_center.py` | ADR-0032 center-root latency guard (direct path) | no |
 `_allreduce_helpers.py` holds the shared plumbing (driver, config writers,
 sweep/buffer-kind constants, plot aggregators, topology-diagram + FSIM
 comparison emitters). It is not collected (no `test_` prefix).
 ### D3. Latency metric — critical-path `pe_exec_ns`
 The reported latency per config is `crit_ns = max(pe_exec_ns)` over
 `engine._results` — the slowest rank's PE execution time. This is the
 number plotted on every latency chart and recorded in `summary.csv`.
 ### D4. Sweep dimensions
 - **Latency sweep**: `n_elem ∈ {8, 32, 64, 128, 512, 1024, 2048, 4096,
  8192, 16384, 32768, 49152}` (16 excluded — collides with `n_cubes`) ×
  topology ∈ {ring_1d (6), torus_2d 2×3 (6), mesh_2d_no_wrap 2×3 (6)}.
 - **Buffer-kind sweep**: `buffer_kind ∈ {tcm, sram, hbm}` × a smaller
  `n_elem` grid, on torus_2d 6-SIP (3×2). buffer_kind is set in the temp
  `ccl.yaml` (read by the backend at `init_process_group`, ADR-0023 D6).
 The 2×3 / 3×2 grids exercise the explicit-`w/h` SIP resolution
 (ADR-0024 D5).
 ### D5. Derived plots via `pytest_sessionfinish` aggregators
 Sweep tests are xdist-friendly: each parametrized case writes one JSON row
 to a staging dir. The conftest `pytest_sessionfinish` hook (controller node
 only) calls the aggregators in `_allreduce_helpers.py`:
 - `_aggregate_sweep_plots()` → per-topology PNGs + `summary.csv`
 - `aggregate_buffer_kind_plot()` → the TCM/SRAM/HBM comparison PNG + csv
 The topology-diagram and FSIM-comparison figures are emitted directly by
 their own `test_plot_*` tests (no row staging — they are pure functions of
 `topology.yaml` and `summary.csv` respectively). All outputs land in
 `docs/diagrams/allreduce_latency_plots/` and are **derived artifacts** per
 CLAUDE.md (consistent-with-ADRs, no Phase-2 gate).
 ### D6. The FSIM comparison reference is a hardcoded constant
 `emit_comparison_fsim_plot()` overlays the model curves against a single
 external FSIM single-device reference (`366 µs`), held as a literal — there
 is no external data file. The "measured" series comes from the simulator
 (`op_log` GEMM count, `composite_window_ns`); the "theoretical" series is a
 hand-derived analytical model (the same one ADR-0044 D5 flags as
 ADR-unverified).
 ## Consequences
 ### Positive
 - The allreduce is evaluated through the same API a real DDP script uses,
  so the harness doubles as an integration test of ADR-0024/0027.
 - Figures regenerate on every `pytest` run from committed data; no manual
  plot step.
 - Rectangular-grid sweeps gave the regression coverage that surfaced the
  ADR-0024 D5 `w/h` fix.
 ### Negative / limitations
 - The full latency sweep runs in the default `pytest` (~minutes); it is not
  marked `slow`. (Contrast ADR-0044, where the GEMM sweep is `slow`.)
 - `test_intercube_root_center.py` carries a latency *threshold* assertion
  (ADR-0032 center-root guard) — the only absolute-latency assertion in the
  suite; it is sensitive to latency-model changes (ADR-0033).
 ## Dependencies
 - **ADR-0013**: verification strategy (general policy this specializes).
 - **ADR-0023 / ADR-0024 / ADR-0027**: IPCQ backend, rank=SIP launcher,
  `mp.spawn` — the path D1 drives.
 - **ADR-0032**: the algorithm under evaluation; D4 grids exercise its
  topology branches.
 - **ADR-0044**: the sibling GEMM evaluation harness.
 ## Open questions
 - Should the latency sweep be marked `slow` for parity with the GEMM sweep?
 - Should the FSIM reference move from a hardcoded constant to a versioned
  data file?
@@ -0,0 +1,130 @@
 # ADR-0044: GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/`
 ## Status
 Accepted
 Documents the GEMM evaluation/characterization harness; verified against the
 implementation (constants, tile sizes, figure set, and the script↔test
 split cross-checked). The D5/D6 caveats are recorded limitations, not
 inaccuracies.
 ## Context
 ADR-0014 (PE pipeline) and ADR-0042 (tile-plan generators) define the GEMM
 *implementation*; ADR-0033 defines the latency model. None of them describe
 **how GEMM performance is swept and characterized** — the shape/variant
 sweep that produces the timing data, and the figures that interpret it.
 This ADR pins that harness.
 Unlike the allreduce harness (ADR-0043), the GEMM sweep is **heavy** (24
 sim runs: 8 shapes × 3 operand-staging variants; the `512` shape alone is
 2048 tiles). That weight drives the split below.
 ## Decision
 ### D1. Two-layer split — heavy data generation (script) vs. fast figures (tests)
 - **Data generation stays a manual script**: `scripts/gemm_sweep.py` runs
  `matmul-composite` (ADR-0042 plans) across shapes × variants via the same
  `run_bench` path the CLI uses, harvests `result.engine.op_log`, and
  writes `docs/diagrams/gemm_sweep.json` (per-stage / per-engine wall-clock
  + occupancy + record counts + pe/composite windows).
 - **Figure rendering is test-generated**: `tests/gemm/` reads the committed
  `gemm_sweep.json` and renders matplotlib PNGs into
  `docs/diagrams/gemm_plots/`. These tests are fast and run by default.
 Rationale: a slide-deck-scale sim sweep does not belong in every `pytest`
 run, but the figures (cheap, deterministic) should regenerate freely and be
 guarded by CI. This mirrors CLAUDE.md's script-vs-test split (scripts for
 heavy/manual generation; tests for fast assertions).
 ### D2. Slow regenerator test wraps the script
 `tests/gemm/test_gemm_sweep.py` is marked `@pytest.mark.slow` (excluded by
 the default `addopts: -m "not slow"`). It invokes `scripts/gemm_sweep.py`
 via subprocess to regenerate `gemm_sweep.json` on demand
 (`pytest -m slow tests/gemm/test_gemm_sweep.py`). The sweep logic has a
 single home (the script); the test only wraps it, so there is no duplicated
 sim-driving code.
 ### D3. Figure set (3 charts, `load_ref` variant)
 | Test | PNG | Content |
 |---|---|---|
 | `test_plot_gemm_stage_breakdown.py` | `gemm_stage_breakdown.png` | per-stage engine wall-clock (DMA in / Fetch / GEMM / DMA out) |
 | `test_plot_gemm_mac_utilization.py` | `gemm_mac_utilization_measured.png` | GEMM util % + useful eff % |
 | `test_plot_gemm_mac_utilization.py` | `gemm_mac_utilization_theoretical_vs_measured.png` | theoretical vs simulator-measured util/eff |
 `tests/gemm/_gemm_plot_helpers.py` holds the shared renderers (series logic
 mirrors the GEMM `_render_*` functions in `scripts/build_overview_slides.py`,
 which still draws these natively in the PPTX). Not collected (no `test_`
 prefix). Each `test_plot_*` skips if `gemm_sweep.json` is absent.
 ### D4. Tile sizes are data-driven; under-tile shapes are flagged
 Tile sizes are read from `gemm_sweep.json` (`tile_sizes`), which the sweep
 records from `PeSchedulerComponent.TILE_M/K/N = 32/64/32` — the authoritative
 source. Shapes with `M<TILE_M ∨ K<TILE_K ∨ N<TILE_N` are flagged
 ("under-tile") on the charts. The `512³` shape is excluded from the figures
 (`EXCLUDED_SHAPES`).
 ### D5. Theoretical model — inherited constants, NOT yet ADR-verified
 The "theoretical" curves use an analytical ideal-pipeline model with
 constants copied verbatim from `scripts/build_overview_slides.py`:
 ```
 HBM_GBS = 256.0   # GB/s        T_STAGE = 16.0 ns
 D_STAGES = 3                    BPE = 2
 ```
 **These are not yet sourced against the ADRs.** Notably ADR-0033's `256`
 is `burst_bytes` (256 B), a *different* quantity than this `256 GB/s`, and
 ADR-0033 derives bandwidth as `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`.
 `T_STAGE`/stage-count are not traced to ADR-0014 here. The model is
 therefore **consistent with the existing deck script, not verified against
 the ADRs**, and the constants are duplicated (deck + helper). Reconciling
 them (source from topology/ADR-0033/0014, de-duplicate) is deferred — see
 Open questions.
 ### D6. Known naming caveat — `_measured` chart
 `gemm_mac_utilization_measured.png` currently plots the *theoretical*
 ideal-pipeline numbers (its footnote says so), only the filename says
 "measured". This is a known misnomer pending a decision to either repoint
 its content to the simulator-measured series or retitle it.
 ## Consequences
 ### Positive
 - GEMM figures are test-generated and CI-guarded, like allreduce.
 - The heavy sweep stays opt-in, keeping the default test run fast.
 - Single source for the sweep logic (the script), reused by the slow test.
 ### Negative / limitations
 - The theoretical-model constants (D5) are unverified and duplicated.
 - The `_measured` figure is a misnomer (D6).
 - `build_overview_slides.py` still renders the GEMM bars natively from
  `gemm_sweep.json` rather than embedding these PNGs — the deck rewiring to
  consume the test artifacts is not done.
 ## Dependencies
 - **ADR-0013**: verification strategy.
 - **ADR-0014 / ADR-0042**: PE pipeline + tile-plan generators — the GEMM
  implementation the sweep measures; D4's stage record counts come from
  ADR-0042 D2/D3.
 - **ADR-0033**: latency model — the source the D5 constants should (but do
  not yet) trace to.
 - **ADR-0043**: the sibling allreduce evaluation harness.
 ## Open questions
 - Reconcile D5 constants against `topology.yaml` / ADR-0033 / ADR-0014 and
  de-duplicate (one source for the model parameters)?
 - Resolve the D6 `_measured` naming (repoint content vs. retitle)?
 - Rewire `build_overview_slides.py` to embed the `gemm_plots/` PNGs instead
  of native bar-drawing?
@@ -0,0 +1,13 @@
 buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns
 hbm,torus_2d,6,128,256,2120.040000000012
 hbm,torus_2d,6,1024,2048,2717.2783333333473
 hbm,torus_2d,6,8192,16384,7315.184999999989
 hbm,torus_2d,6,32768,65536,23081.26500000037
 sram,torus_2d,6,128,256,2060.040000000012
 sram,torus_2d,6,1024,2048,2909.2783333333473
 sram,torus_2d,6,8192,16384,9523.184999999869
 sram,torus_2d,6,32768,65536,32201.265000000385
 tcm,torus_2d,6,128,256,1964.040000000012
 tcm,torus_2d,6,1024,2048,2477.2783333333473
 tcm,torus_2d,6,8192,16384,6403.185000000109
 tcm,torus_2d,6,32768,65536,19865.265000000378
@@ -1,13 +0,0 @@
 buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns
 hbm,torus_2d,6,128,256,2120.0399999999754
 hbm,torus_2d,6,1024,2048,2716.74499999995
 hbm,torus_2d,6,8192,16384,7315.185000000081
 hbm,torus_2d,6,32768,65536,23081.265000008738
 sram,torus_2d,6,128,256,2060.0399999999754
 sram,torus_2d,6,1024,2048,2908.74499999995
 sram,torus_2d,6,8192,16384,9523.185000000081
 sram,torus_2d,6,32768,65536,32201.265000008752
 tcm,torus_2d,6,128,256,1964.0399999999754
 tcm,torus_2d,6,1024,2048,2476.74499999995
 tcm,torus_2d,6,8192,16384,6403.185000000081
 tcm,torus_2d,6,32768,65536,19865.265000008738
@@ -1,37 +1,37 @@
 algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
-intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,2666.5524999999725
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,8,16,256,2666.552500000015
-intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,2747.7399999999725
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,32,64,1024,2747.7400000000152
-intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,2855.98999999998
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,64,128,2048,2855.990000000018
-intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,3072.4899999999725
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,128,256,4096,3072.490000000019
-intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3336.579999999951
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3337.1133333333582
-intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3707.49999999992
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3708.0333333333692
-intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4449.339999999875
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4449.873333333393
-intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,5933.020000000055
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,5933.020000000124
-intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,8900.380000000157
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,8900.379999999863
-intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,14835.099999997583
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,14835.099999999224
-intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,26704.540000017492
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,26704.540000000765
-intercube_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,38573.980000026335
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,38573.97999999701
-intercube_allreduce,ring_1d,6,8,16,256,2365.2558333333036
+lrab_hierarchical_allreduce,ring_1d,6,8,16,256,2365.255833333347
-intercube_allreduce,ring_1d,6,32,64,1024,2436.9433333333036
+lrab_hierarchical_allreduce,ring_1d,6,32,64,1024,2436.9433333333473
-intercube_allreduce,ring_1d,6,64,128,2048,2532.526666666643
+lrab_hierarchical_allreduce,ring_1d,6,64,128,2048,2532.526666666683
-intercube_allreduce,ring_1d,6,128,256,4096,2723.6933333333036
+lrab_hierarchical_allreduce,ring_1d,6,128,256,4096,2723.693333333349
-intercube_allreduce,ring_1d,6,512,1024,16384,3042.0349999999544
+lrab_hierarchical_allreduce,ring_1d,6,512,1024,16384,3048.635000000021
-intercube_allreduce,ring_1d,6,1024,2048,32768,3390.201666666597
+lrab_hierarchical_allreduce,ring_1d,6,1024,2048,32768,3393.4016666666957
-intercube_allreduce,ring_1d,6,2048,4096,65536,4079.7349999998714
+lrab_hierarchical_allreduce,ring_1d,6,2048,4096,65536,4082.401666666714
-intercube_allreduce,ring_1d,6,4096,8192,131072,5458.801666666721
+lrab_hierarchical_allreduce,ring_1d,6,4096,8192,131072,5458.80166666677
-intercube_allreduce,ring_1d,6,8192,16384,262144,8216.93500000014
+lrab_hierarchical_allreduce,ring_1d,6,8192,16384,262144,8216.934999999943
-intercube_allreduce,ring_1d,6,16384,32768,524288,13733.201666664638
+lrab_hierarchical_allreduce,ring_1d,6,16384,32768,524288,13733.201666665835
-intercube_allreduce,ring_1d,6,32768,65536,1048576,24765.735000014545
+lrab_hierarchical_allreduce,ring_1d,6,32768,65536,1048576,24765.73500000064
-intercube_allreduce,ring_1d,6,49152,98304,1572864,35798.268333355256
+lrab_hierarchical_allreduce,ring_1d,6,49152,98304,1572864,35798.268333331536
-intercube_allreduce,torus_2d,6,8,16,256,1700.6024999999754
+lrab_hierarchical_allreduce,torus_2d,6,8,16,256,1700.6025000000095
-intercube_allreduce,torus_2d,6,32,64,1024,1753.2899999999754
+lrab_hierarchical_allreduce,torus_2d,6,32,64,1024,1753.2900000000102
-intercube_allreduce,torus_2d,6,64,128,2048,1823.539999999979
+lrab_hierarchical_allreduce,torus_2d,6,64,128,2048,1823.540000000012
-intercube_allreduce,torus_2d,6,128,256,4096,1964.0399999999754
+lrab_hierarchical_allreduce,torus_2d,6,128,256,4096,1964.040000000012
-intercube_allreduce,torus_2d,6,512,1024,16384,2196.2849999999653
+lrab_hierarchical_allreduce,torus_2d,6,512,1024,16384,2196.8183333333463
-intercube_allreduce,torus_2d,6,1024,2048,32768,2476.74499999995
+lrab_hierarchical_allreduce,torus_2d,6,1024,2048,32768,2477.2783333333473
-intercube_allreduce,torus_2d,6,2048,4096,65536,3037.664999999919
+lrab_hierarchical_allreduce,torus_2d,6,2048,4096,65536,3038.1983333333583
-intercube_allreduce,torus_2d,6,4096,8192,131072,4159.50500000003
+lrab_hierarchical_allreduce,torus_2d,6,4096,8192,131072,4159.5050000000665
-intercube_allreduce,torus_2d,6,8192,16384,262144,6403.185000000081
+lrab_hierarchical_allreduce,torus_2d,6,8192,16384,262144,6403.185000000109
-intercube_allreduce,torus_2d,6,16384,32768,524288,10890.544999998769
+lrab_hierarchical_allreduce,torus_2d,6,16384,32768,524288,10890.5449999995
-intercube_allreduce,torus_2d,6,32768,65536,1048576,19865.265000008738
+lrab_hierarchical_allreduce,torus_2d,6,32768,65536,1048576,19865.265000000378
-intercube_allreduce,torus_2d,6,49152,98304,1572864,28839.985000013185
+lrab_hierarchical_allreduce,torus_2d,6,49152,98304,1572864,28839.98500000059
@@ -1,81 +1,81 @@
 hop,label,size_bytes,path,total_ns
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,ipcq,24.88749999999891
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),128,ipcq,24.88749999999891
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,raw,33.57999999999811
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),128,raw,33.57999999999811
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,ipcq,28.13749999999891
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),256,ipcq,28.13749999999891
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,raw,36.07999999999811
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),256,raw,36.07999999999811
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,ipcq,29.88749999999891
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),384,ipcq,29.88749999999891
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,raw,37.07999999999811
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),384,raw,37.07999999999811
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,ipcq,31.63749999999891
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),512,ipcq,31.63749999999891
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,raw,38.07999999999811
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),512,raw,38.07999999999811
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,ipcq,35.13749999999891
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),768,ipcq,35.13749999999891
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,raw,40.07999999999811
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),768,raw,40.07999999999811
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,ipcq,38.63749999999891
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),1024,ipcq,38.63749999999891
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,raw,42.07999999999811
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),1024,raw,42.07999999999811
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,ipcq,52.63749999999891
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),2048,ipcq,52.63749999999891
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,raw,50.07999999999811
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),2048,raw,50.07999999999811
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,ipcq,80.63750000000073
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),4096,ipcq,80.63750000000073
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,raw,66.08000000000175
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),4096,raw,66.08000000000175
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,ipcq,136.63750000000073
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),8192,ipcq,136.63750000000073
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,raw,98.08000000000175
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),8192,raw,98.08000000000175
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,ipcq,164.63750000000073
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),10240,ipcq,164.63750000000073
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,raw,114.08000000000175
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),10240,raw,114.08000000000175
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,ipcq,38.49749999999585
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),128,ipcq,38.49749999999585
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,raw,47.18999999999505
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),128,raw,47.18999999999505
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,ipcq,43.24749999999585
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),256,ipcq,43.24749999999585
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,raw,51.18999999999505
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),256,raw,51.18999999999505
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,ipcq,44.99749999999585
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),384,ipcq,44.99749999999585
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,raw,52.18999999999505
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),384,raw,52.18999999999505
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,ipcq,46.74749999999585
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),512,ipcq,46.74749999999585
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,raw,53.18999999999505
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),512,raw,53.18999999999505
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,ipcq,50.24749999999585
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),768,ipcq,50.24749999999585
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,raw,55.18999999999505
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),768,raw,55.18999999999505
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,ipcq,53.74749999999585
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),1024,ipcq,53.74749999999585
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,raw,57.18999999999505
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),1024,raw,57.18999999999505
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,ipcq,67.74749999999585
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),2048,ipcq,67.74749999999585
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,raw,65.18999999999505
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),2048,raw,65.18999999999505
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,ipcq,95.74750000000131
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),4096,ipcq,95.74750000000131
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,raw,81.19000000000233
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),4096,raw,81.19000000000233
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,ipcq,151.7475000000013
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),8192,ipcq,151.7475000000013
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,raw,113.19000000000233
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),8192,raw,113.19000000000233
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,ipcq,179.7475000000013
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),10240,ipcq,179.7475000000013
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,raw,129.19000000000233
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),10240,raw,129.19000000000233
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,ipcq,81.15999999999804
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),128,ipcq,81.15999999999804
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,raw,89.28999999999724
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),128,raw,89.28999999999724
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,ipcq,88.65999999999804
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),256,ipcq,88.65999999999804
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,raw,95.53999999999724
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),256,raw,95.53999999999724
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,ipcq,90.90999999999804
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),384,ipcq,90.90999999999804
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,raw,96.53999999999724
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),384,raw,96.53999999999724
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,ipcq,93.15999999999804
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),512,ipcq,93.15999999999804
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,raw,97.53999999999724
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),512,raw,97.53999999999724
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,ipcq,97.65999999999804
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),768,ipcq,97.65999999999804
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,raw,99.53999999999724
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),768,raw,99.53999999999724
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,ipcq,103.15999999999804
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),1024,ipcq,103.15999999999804
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,raw,102.53999999999724
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),1024,raw,102.53999999999724
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,ipcq,125.15999999999804
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),2048,ipcq,125.15999999999804
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,raw,114.53999999999724
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),2048,raw,114.53999999999724
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,ipcq,169.15999999999985
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),4096,ipcq,169.15999999999985
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,raw,138.54000000000087
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),4096,raw,138.54000000000087
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,ipcq,257.15999999999985
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),8192,ipcq,257.15999999999985
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,raw,186.54000000000087
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),8192,raw,186.54000000000087
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,ipcq,301.15999999999985
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),10240,ipcq,301.15999999999985
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,raw,210.54000000000087
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),10240,raw,210.54000000000087
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,ipcq,103.15999999999804
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),128,ipcq,103.15999999999804
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,raw,111.28999999999724
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),128,raw,111.28999999999724
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,ipcq,112.65999999999804
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),256,ipcq,112.65999999999804
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,raw,119.53999999999724
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),256,raw,119.53999999999724
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,ipcq,114.90999999999804
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),384,ipcq,114.90999999999804
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,raw,120.53999999999724
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),384,raw,120.53999999999724
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,ipcq,117.15999999999804
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),512,ipcq,117.15999999999804
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,raw,121.53999999999724
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),512,raw,121.53999999999724
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,ipcq,121.65999999999804
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),768,ipcq,121.65999999999804
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,raw,123.53999999999724
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),768,raw,123.53999999999724
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,ipcq,127.15999999999804
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),1024,ipcq,127.15999999999804
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,raw,126.53999999999724
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),1024,raw,126.53999999999724
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,ipcq,149.15999999999804
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),2048,ipcq,149.15999999999804
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,raw,138.53999999999724
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),2048,raw,138.53999999999724
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,ipcq,193.15999999999985
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),4096,ipcq,193.15999999999985
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,raw,162.54000000000087
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),4096,raw,162.54000000000087
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,ipcq,281.15999999999985
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),8192,ipcq,281.15999999999985
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,raw,210.54000000000087
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),8192,raw,210.54000000000087
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,ipcq,325.15999999999985
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),10240,ipcq,325.15999999999985
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,raw,234.54000000000087
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),10240,raw,234.54000000000087
@@ -12,8 +12,8 @@ dependencies = ["pytest", "simpy", "pyyaml", "fastapi>=0.110", "uvicorn[standard
 kernbench = "kernbench.cli.main:main"
 [tool.setuptools.packages.find]
-where = ["src", "."]
+where = ["src"]
-include = ["kernbench*", "benches*"]
+include = ["kernbench*"]
 [project.optional-dependencies]
 dev = [
@@ -4,8 +4,8 @@ Slides:
  1. Overall architecture — how PEs are connected (cube_mesh_view)
  2. Model correctness — DMA vs P2P latency (pe2pe overview)
  3. PE-to-PE IPCQ communication (ipcq_two_pe_dma)
-  4. 6-device allreduce — model vs theoretical vs ext-sim (overview_broken)
+  4. 6-device allreduce — model vs theoretical vs FSIM (comparison_…_fsim)
-  5. IPCQ buffer-kind sweep — TCM vs SRAM vs HBM (buffer_kind_sweep)
+  5. IPCQ buffer-kind sweep — TCM vs SRAM vs HBM (…_with_TCM_SRAM_HBM)
  6. PE_accelerator data path (composite GEMM pipeline structure)
  7. matmul(32, 128, 32) — composite GEMM execution sequence
  8. matmul(32, 128, 128) — pipeline scaling and HBM contention
@@ -63,7 +63,7 @@ SLIDES = [
    },
    {
        "title": "4. 6-Device Allreduce: Model vs Theoretical vs External Simulator",
-        "image": DIAG / "allreduce_latency_plots" / "overview_broken.png",
+        "image": DIAG / "allreduce_latency_plots" / "comparison_mesh_vs_ring_vs_2DTorus_vs_theoretical_vs_fsim.png",
        "bullets": [
            "Three SIP topologies (ring / torus / mesh) swept 16 B → 96 KB per PE",
            "Dashed red curve: hand-derived theoretical model for torus_2d (6 SIPs)",
@@ -73,7 +73,7 @@ SLIDES = [
    },
    {
        "title": "5. IPCQ Slot Memory: TCM vs SRAM vs HBM",
-        "image": DIAG / "allreduce_latency_plots" / "buffer_kind_sweep.png",
+        "image": DIAG / "allreduce_latency_plots" / "AllReduce_LRAB_2Dtorus_6SiP_2x3_with_TCM_SRAM_HBM.png",
        "bullets": [
            "Same allreduce with slot memory swapped: TCM (per-PE local) / SRAM / HBM (cube-shared, behind router link)",
            "Cost = NoC drain + slot-IO + PE↔bank hop; only TCM skips the bank hop",
@@ -1,192 +0,0 @@
 """One-shot: render overview.png with an external 366 µs reference, in two
 variants — log scale and broken y-axis. Reads docs/diagrams/allreduce_latency_plots/summary.csv
 and writes overview_log.png and overview_broken.png alongside it.
 This is a derived-artifact generator (per CLAUDE.md): plotting only, no production
 or test logic touched.
 """
 from __future__ import annotations
 import csv
 from pathlib import Path
 import matplotlib.pyplot as plt
 import matplotlib.ticker as mticker
 ROOT = Path(__file__).resolve().parent.parent
 PLOT_DIR = ROOT / "docs" / "diagrams" / "allreduce_latency_plots"
 CSV_PATH = PLOT_DIR / "summary.csv"
 EXT_LABEL = "ext-sim single-device reduce: 366 µs"
 EXT_LATENCY_NS = 366_000.0
 COLORS = {
    "ring_1d": "tab:blue",
    "torus_2d": "tab:orange",
    "mesh_2d_no_wrap": "tab:green",
 }
 # Hand-derived theoretical model for torus_2d (6 SIPs). Mirrors
 # _aggregate_sweep_plots in tests/test_allreduce_multidevice.py.
 NOC_PACKET_BYTES = 128
 PES_PER_CUBE = 8
 T_STARTUP_NS = 1346.0
 TAU_NS = (8741.0 - 1346.0) / (6144 - 1)
 def _theoretical_torus_2d_ns(bytes_per_pe: int) -> float:
    bytes_per_cube = int(bytes_per_pe) * PES_PER_CUBE
    n_packets = max(1, -(-bytes_per_cube // NOC_PACKET_BYTES))
    return T_STARTUP_NS + (n_packets - 1) * TAU_NS
 def _plot_theoretical(ax, records):
    torus_rs = sorted(
        [r for r in records if r["sip_topology"] == "torus_2d"],
        key=lambda r: r["bytes_per_pe"],
    )
    if not torus_rs:
        return
    ax.plot(
        [r["bytes_per_pe"] for r in torus_rs],
        [_theoretical_torus_2d_ns(r["bytes_per_pe"]) for r in torus_rs],
        color="tab:red", linestyle="--", linewidth=1.6, marker="x",
        label="theoretical torus_2d (6 SIPs)",
    )
 def _bytes_fmt(x, _pos):
    if x >= 1024 * 1024:
        return f"{x / (1024 * 1024):.0f}M"
    if x >= 1024:
        return f"{x / 1024:.0f}K"
    return f"{int(x)}"
 def _load_records():
    rows = []
    with open(CSV_PATH, newline="") as f:
        r = csv.DictReader(f)
        for row in r:
            rows.append({
                "sip_topology": row["sip_topology"],
                "bytes_per_pe": int(row["bytes_per_pe"]),
                "latency_ns": float(row["latency_ns"]),
            })
    return rows
 def _ext_x(records):
    """Anchor the external reference at the largest payload (96 KB / PE)."""
    return max(r["bytes_per_pe"] for r in records)
 def _plot_curves(ax, records, topologies):
    for topo in topologies:
        rs = sorted([r for r in records if r["sip_topology"] == topo],
                    key=lambda r: r["bytes_per_pe"])
        if not rs:
            continue
        ax.plot(
            [r["bytes_per_pe"] for r in rs],
            [r["latency_ns"] for r in rs],
            marker="o",
            label=f"{topo}",
            color=COLORS.get(topo),
        )
 def emit_log(records):
    topologies = sorted({r["sip_topology"] for r in records})
    fig, ax = plt.subplots(figsize=(9, 6))
    _plot_curves(ax, records, topologies)
    _plot_theoretical(ax, records)
    ax.scatter(
        [_ext_x(records)], [EXT_LATENCY_NS],
        marker="*", s=220, color="tab:red", zorder=5,
        label=EXT_LABEL,
    )
    ax.set_xscale("log", base=2)
    ax.set_yscale("log")
    ax.set_xlabel("Bytes per PE (log scale)")
    ax.set_ylabel("Time (ns) — log scale")
    ax.set_title("Multi-device allreduce latency vs external single-device reference")
    ax.grid(True, which="both", alpha=0.3)
    ax.xaxis.set_major_formatter(mticker.FuncFormatter(_bytes_fmt))
    ax.legend(loc="upper left")
    fig.tight_layout()
    out = PLOT_DIR / "overview_log.png"
    fig.savefig(out, dpi=120)
    plt.close(fig)
    print(f"wrote {out}")
 def emit_broken(records):
    topologies = sorted({r["sip_topology"] for r in records})
    max_local = max(r["latency_ns"] for r in records)
    fig, (ax_top, ax_bot) = plt.subplots(
        2, 1, sharex=True,
        gridspec_kw={"height_ratios": [1, 4], "hspace": 0.05},
        figsize=(9, 6.5),
    )
    # Bottom panel: today's three curves + theoretical, linear y.
    _plot_curves(ax_bot, records, topologies)
    _plot_theoretical(ax_bot, records)
    ax_bot.set_ylim(0, max_local * 1.10)
    # Top panel: only the external reference marker, linear y around 366 µs.
    ax_top.scatter(
        [_ext_x(records)], [EXT_LATENCY_NS],
        marker="*", s=240, color="tab:red", zorder=5,
        label=EXT_LABEL,
    )
    ax_top.set_ylim(EXT_LATENCY_NS * 0.93, EXT_LATENCY_NS * 1.05)
    # Hide the spine between the two panels and draw diagonal "break" ticks.
    ax_top.spines["bottom"].set_visible(False)
    ax_bot.spines["top"].set_visible(False)
    ax_top.tick_params(labeltop=False, bottom=False)
    ax_bot.xaxis.tick_bottom()
    d = 0.012  # diagonal-tick size, in axis-fraction
    kw = dict(transform=ax_top.transAxes, color="k", clip_on=False, lw=1)
    ax_top.plot((-d, +d), (-d, +d), **kw)
    ax_top.plot((1 - d, 1 + d), (-d, +d), **kw)
    kw.update(transform=ax_bot.transAxes)
    ax_bot.plot((-d, +d), (1 - d * 4, 1 + d * 4), **kw)
    ax_bot.plot((1 - d, 1 + d), (1 - d * 4, 1 + d * 4), **kw)
    ax_bot.set_xscale("log", base=2)
    ax_bot.set_xlabel("Bytes per PE (log scale)")
    ax_bot.set_ylabel("Time (ns)")
    ax_top.set_ylabel("Time (ns)")
    ax_bot.grid(True, alpha=0.3)
    ax_top.grid(True, alpha=0.3)
    ax_bot.xaxis.set_major_formatter(mticker.FuncFormatter(_bytes_fmt))
    # One legend covering both axes.
    handles_bot, labels_bot = ax_bot.get_legend_handles_labels()
    handles_top, labels_top = ax_top.get_legend_handles_labels()
    ax_bot.legend(handles_bot + handles_top, labels_bot + labels_top,
                  loc="upper left")
    fig.suptitle("Multi-device allreduce latency vs external single-device reference (broken y-axis)")
    fig.tight_layout()
    out = PLOT_DIR / "overview_broken.png"
    fig.savefig(out, dpi=120)
    plt.close(fig)
    print(f"wrote {out}")
 def main():
    records = _load_records()
    if not records:
        raise SystemExit(f"no rows in {CSV_PATH}")
    emit_log(records)
    emit_broken(records)
 if __name__ == "__main__":
    main()
@@ -117,19 +117,19 @@ def _run_one(M: int, K: int, N: int, topology: str, variant: str = "ref_ref") ->
    os.environ["MATMUL_N"] = str(N)
    os.environ["MATMUL_VARIANT"] = variant
-    # Late imports so env vars are read by benches/matmul_composite at module load.
+    # Late imports so env vars are read by matmul_composite at module load.
    # Force re-import to pick up new env values.
-    for mod_name in [m for m in list(sys.modules) if m.startswith("benches.matmul_composite")]:
+    for mod_name in [m for m in list(sys.modules) if m.startswith("kernbench.benches.matmul_composite")]:
        del sys.modules[mod_name]
-    from benches.loader import resolve_bench
+    from kernbench.benches.registry import resolve as resolve_bench
    from kernbench.runtime_api.bench_runner import run_bench
    from kernbench.runtime_api.types import resolve_device
    from kernbench.sim_engine.engine import GraphEngine
    from kernbench.topology.builder import resolve_topology
    topo = resolve_topology(topology)
-    bench = resolve_bench("matmul_composite")
+    bench = resolve_bench("matmul-composite").run
    device = resolve_device(None)
    t0 = time.time()
@@ -0,0 +1,9 @@
 """kernbench.benches: eager-import sibling modules so @bench fires.
 Underscore-prefixed modules are treated as helpers and skipped.
 After import, every imported module must have registered at least one
 bench, or a RuntimeError is raised by the audit.
 """
 from kernbench.benches.registry import _eager_import_and_audit
 _eager_import_and_audit(__path__, __name__)
@@ -14,6 +14,7 @@ from dataclasses import dataclass
 import numpy as np
 from kernbench.benches.registry import bench
 from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
 from kernbench.policy.placement.dp import DPPolicy
@@ -95,6 +96,10 @@ def _worker(rank: int, cfg: _BenchCfg, torch) -> None:
        _report(tensor.numpy(), cfg)
@bench(
    name="ccl-allreduce",
    description="CCL all-reduce bench (TP launcher; rank = SIP).",
 )
 def run(torch) -> None:
    torch.distributed.init_process_group(backend="ahbm")
    cfg = _resolve_cfg(torch)
@@ -10,6 +10,7 @@ per-tile DMA internally.
 Run:
    kernbench run gemm_single_pe
 """
 from kernbench.benches.registry import bench
 from kernbench.policy.placement.dp import DPPolicy
 # GEMM dimensions: (M, K) x (K, N) → (M, N)
@@ -27,6 +28,10 @@ def _gemm_kernel(a_ptr, b_ptr, out_ptr, M, K, N, tl, DTYPE="f16"):
    tl.wait(h)
@bench(
    name="gemm-single-pe",
    description="Single-PE GEMM via scheduler_v2 (pe_accel).",
 )
 def run(torch):
    """Run the single-PE GEMM benchmark."""
    dp = DPPolicy(cube="replicate", pe="replicate",
@@ -20,6 +20,7 @@ topology.yaml is unchanged.
 Run:
    kernbench run gpt3_qkv
 """
 from kernbench.benches.registry import bench
 from kernbench.policy.placement.dp import DPPolicy
 # -- PE configuration (DPPolicy overrides — does not change topology.yaml) -----
@@ -66,6 +67,10 @@ def _gpt3_qkv_kernel(x_ptr, wq_ptr, wk_ptr, wv_ptr,
    tl.wait(hv)
@bench(
    name="gpt3-qkv",
    description="GPT-3 QKV projection sharded column-wise across all PEs.",
 )
 def run(torch):
    """Run the GPT-3 QKV benchmark."""
    M = SEQ_LEN
@@ -0,0 +1,9 @@
 from kernbench.benches.registry import bench
@bench(
    name="ipcq-allreduce",
    description="IPCQ all-reduce kernel bench (placeholder).",
 )
 def run(torch):
    print("IPCQ all reduce kernel bench")
@@ -17,6 +17,7 @@ Run:
 """
 import os
 from kernbench.benches.registry import bench
 from kernbench.policy.placement.dp import DPPolicy
 M = int(os.environ.get("MATMUL_M", "256"))
@@ -57,6 +58,10 @@ _KERNELS = {
 }
@bench(
    name="matmul-composite",
    description="Single-PE composite GEMM with ref/load variants for perf characterization.",
 )
 def run(torch):
    if VARIANT not in _KERNELS:
        raise ValueError(f"unknown MATMUL_VARIANT={VARIANT!r}; "
@@ -7,6 +7,7 @@ Kernel: tl.load(a) + tl.ref(b) + tl.composite(gemm) + tl.wait()
  - Tensor a is loaded into TCM via DMA
  - Tensor b stays in HBM; PE_SCHEDULER streams it per-tile (32x64x32)
 """
 from kernbench.benches.registry import bench
 from kernbench.policy.placement.dp import DPPolicy
 # GEMM dimensions: (M, K) x (K, N) → (M, N)
@@ -28,6 +29,10 @@ def _gemm_kernel(a_ptr, b_ptr, out_ptr, M, K, N, tl, DTYPE="f16"):
    tl.wait(handle)
@bench(
    name="qkv-gemm",
    description="QKV GEMM (Q*K^T) on a single PE — full host-to-PE pipeline.",
 )
 def run(torch):
    """Run the QKV GEMM benchmark."""
    # DP placement: a=replicate (cube-level), b/out=column_wise (N-axis, single PE)
@@ -7,6 +7,7 @@ Kernel: tl.load(a) + tl.ref(b) + tl.composite(gemm) + tl.wait()
  - Tensor a is loaded into TCM via DMA
  - Tensor b stays in HBM; PE_SCHEDULER streams it per-tile (32x64x32)
 """
 from kernbench.benches.registry import bench
 from kernbench.policy.placement.dp import DPPolicy
 # GEMM dimensions: (M, K) x (K, N) -> (M, N)
@@ -28,6 +29,10 @@ def _gemm_kernel(a_ptr, b_ptr, out_ptr, M, K, N, tl, DTYPE="f16"):
    tl.wait(handle)
@bench(
    name="qkv-gemm-multi-pe",
    description="Column-parallel QKV GEMM across all PEs in a cube (multi-PE).",
 )
 def run(torch):
    """Run the multi-PE QKV GEMM benchmark."""
    # DP placement: a=replicate (cube-level), b/out=column_wise (N-axis split)
@@ -0,0 +1,106 @@
 """Bench registry: @bench decorator + name/index resolution.
 Each bench module under ``kernbench.benches`` MUST register its callable
 via ``@bench(name=..., description=...)``. Indices are assigned
 alphabetically by name after eager import; they are a CLI convenience,
 not a stable API.
 """
 from __future__ import annotations
 import re
 from collections.abc import Callable
 from dataclasses import dataclass
 from importlib import import_module
 from pkgutil import iter_modules
 BenchFn = Callable[..., object]
 _NAME_RE = re.compile(r"^[a-z][a-z0-9]*(-[a-z0-9]+)*$")
@dataclass(frozen=True)
 class BenchSpec:
    index: int
    name: str
    description: str
    run: BenchFn
 _PENDING: list[tuple[str, str, BenchFn]] = []
 _REGISTERED_MODULES: set[str] = set()
 _REGISTRY: dict[str, BenchSpec] = {}
 def bench(*, name: str, description: str) -> Callable[[BenchFn], BenchFn]:
    if not isinstance(name, str) or not _NAME_RE.match(name):
        raise ValueError(
            f"bench name {name!r} must be kebab-case (lowercase, digits, dashes; "
            f"starts with a letter)."
        )
    if not isinstance(description, str) or not description.strip():
        raise ValueError(f"bench {name!r}: description must be a non-empty string.")
    def deco(fn: BenchFn) -> BenchFn:
        _PENDING.append((name, description, fn))
        _REGISTERED_MODULES.add(fn.__module__)
        return fn
    return deco
 def _finalize() -> None:
    if _REGISTRY:
        return
    seen: set[str] = set()
    for n, _, _ in _PENDING:
        if n in seen:
            raise RuntimeError(f"duplicate bench name: {n!r}")
        seen.add(n)
    for i, (n, d, f) in enumerate(sorted(_PENDING, key=lambda t: t[0]), start=1):
        _REGISTRY[n] = BenchSpec(index=i, name=n, description=d, run=f)
 def list_all() -> list[BenchSpec]:
    _finalize()
    return sorted(_REGISTRY.values(), key=lambda s: s.index)
 def resolve(identifier: str) -> BenchSpec:
    _finalize()
    if not isinstance(identifier, str) or not identifier.strip():
        raise ValueError("bench identifier must be a non-empty string.")
    ident = identifier.strip()
    if ident.isdigit():
        idx = int(ident)
        for s in _REGISTRY.values():
            if s.index == idx:
                return s
        raise ValueError(
            f"No bench with index {idx}. Use 'kernbench list' to see options."
        )
    if ident in _REGISTRY:
        return _REGISTRY[ident]
    raise ValueError(
        f"Unknown bench {ident!r}. Use 'kernbench list' to see options."
    )
 def _audit_modules(imported: list[str], registered: set[str]) -> None:
    missing = sorted(m for m in imported if m not in registered)
    if missing:
        raise RuntimeError(
            f"Bench module(s) missing @bench decorator: {missing}. "
            f"Each file under kernbench.benches/ must register at least one bench "
            f"via @bench(...), or be renamed with a leading underscore if it is a "
            f"helper."
        )
 def _eager_import_and_audit(pkg_path: list[str], pkg_name: str) -> None:
    imported: list[str] = []
    for m in iter_modules(pkg_path):
        if m.name == "registry" or m.name.startswith("_"):
            continue
        mod = import_module(f"{pkg_name}.{m.name}")
        imported.append(mod.__name__)
    _audit_modules(imported, _REGISTERED_MODULES)
@@ -9,6 +9,7 @@ The kernel uses standard Triton patterns:
  - tl.num_programs(0) for PE count within cube
  - Shape args are automatically localized by launch()
 """
 from kernbench.benches.registry import bench
 from kernbench.policy.placement.dp import DPPolicy
 M, K = 128, 256
@@ -26,6 +27,10 @@ def _copy_kernel(src_ptr, dst_ptr, M, K, tl, DTYPE="f16"):
    tl.store(dst_ptr + offset, data)
@bench(
    name="va-offset-verify",
    description="Triton base_ptr + pid * stride VA addressing verification (TP sharded).",
 )
 def run(torch):
    """Run the VA offset verification benchmark with full TP sharding."""
    dp = DPPolicy(cube="column_wise", pe="column_wise")
@@ -1,14 +1,16 @@
 import argparse
 import sys
 from typing import cast
-from benches.loader import resolve_bench
+from kernbench.benches.registry import list_all, resolve
 from kernbench.cli.probe import cmd_probe
 from kernbench.cli.report import format_report
 from kernbench.common.types import SimEngine
 from kernbench.probes.probe import cmd_probe
 from kernbench.runtime_api.bench_runner import run_bench
 from kernbench.runtime_api.types import DeviceSelector, resolve_device
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 from kernbench.topology.types import TopologyGraph
 def build_parser() -> argparse.ArgumentParser:
@@ -17,7 +19,10 @@ def build_parser() -> argparse.ArgumentParser:
    runp = sub.add_parser("run", help="Run a benchmark")
    runp.add_argument("--topology", required=True)
-    runp.add_argument("--bench", required=True)
+    runp.add_argument(
        "--bench", required=True,
        help="Bench name (kebab-case) or numeric index from 'kernbench list'",
    )
    runp.add_argument(
        "--device", default=None, help="Target device: 'all' or 'sip:<N>' (default: all)"
    )
@@ -27,6 +32,9 @@ def build_parser() -> argparse.ArgumentParser:
    )
    runp.set_defaults(_handler=cmd_run)
    listp = sub.add_parser("list", help="List registered benches")
    listp.set_defaults(_handler=cmd_list)
    probep = sub.add_parser("probe", help="Probe latency and BW for predefined traffic patterns")
    probep.add_argument("--topology", required=True)
    probep.add_argument("--case", default="all", help="Case name or 'all' (default: all)")
@@ -43,7 +51,7 @@ def build_parser() -> argparse.ArgumentParser:
 def engine_factory(
    topology: object, device: DeviceSelector, *, enable_data: bool = False,
 ) -> SimEngine:
-    topo_obj = getattr(topology, "topology_obj", topology)
+    topo_obj = cast(TopologyGraph, getattr(topology, "topology_obj", topology))
    return GraphEngine(topo_obj, enable_data=enable_data)
@@ -53,23 +61,34 @@ def cmd_web(args) -> int:
    return 0
 def cmd_list(args) -> int:
    specs = list_all()
    print(f"{'#':>3}  {'NAME':<22}  DESCRIPTION")
    print("-" * 80)
    for s in specs:
        print(f"{s.index:>3}  {s.name:<22}  {s.description}")
    return 0
 def cmd_run(args) -> int:
    print("> Running benchmark with:", args)
    topo = resolve_topology(args.topology)
-    bench = resolve_bench(args.bench)
+    spec_entry = resolve(args.bench)
    device = resolve_device(args.device)
    verify_data = getattr(args, "verify_data", False)
    def _factory(topology, device):
        return engine_factory(topology, device, enable_data=verify_data)
-    result = run_bench(topology=topo, bench_fn=bench, device=device, engine_factory=_factory)
+    result = run_bench(
        topology=topo, bench_fn=spec_entry.run, device=device, engine_factory=_factory,
    )
    topo_obj = getattr(topo, "topology_obj", topo)
    spec = getattr(topo_obj, "spec", None)
    if result.traces:
-        print(format_report(result.traces, title=args.bench, spec=spec))
+        print(format_report(result.traces, title=spec_entry.name, spec=spec))
    print(result.summary_text())
    # Phase 2 diagnostic summary (ADR-0020). The actual Phase 2 replay
@@ -0,0 +1,5 @@
 """kernbench.probes: latency/BW diagnostic utilities (not benchmarks).
 See ADR-0010 D4. Probe is a developer tool for verifying the latency/BW
 model; it bypasses the bench registry.
 """
@@ -59,10 +59,23 @@ class AhbmCCLBackend:
            self._sip_topo_kind = topo_map.get(self._sip_topo, 0)
        else:
            self._sip_topo_kind = 0
        sips = spec.get("system", {}).get("sips", {})
        if self._sip_topo == "ring_1d":
            self._sip_topo_w, self._sip_topo_h = 0, 0
        elif sips.get("w") is not None and sips.get("h") is not None:
            w, h = int(sips["w"]), int(sips["h"])
            if w * h != self._n_sips:
                raise ValueError(
                    f"sip layout {w}x{h} != sips.count ({self._n_sips})"
                )
            self._sip_topo_w, self._sip_topo_h = w, h
        else:
            side = int(round(math.sqrt(self._n_sips)))
            if side * side != self._n_sips:
                raise ValueError(
                    f"SIP topology '{self._sip_topo}' requires square "
                    f"sips.count or explicit sips.w/h, got {self._n_sips}"
                )
            self._sip_topo_w, self._sip_topo_h = side, side
        # IPCQ install: wire all pe0s across all cubes and SIPs
@@ -2,9 +2,13 @@ from __future__ import annotations
 import re
 from dataclasses import dataclass
 from typing import TYPE_CHECKING
 from kernbench.common.types import Completion, Trace
 if TYPE_CHECKING:
    from kernbench.sim_engine.engine import GraphEngine
@dataclass(frozen=True)
 class BenchResult:
@@ -12,7 +16,7 @@ class BenchResult:
    correlation_id: str
    trace: Trace | None = None
    traces: list[dict] | None = None
-    engine: object | None = None  # GraphEngine ref for Phase 2 data access
+    engine: GraphEngine | None = None
    def summary_text(self) -> str:
        if self.completion.ok:
@@ -46,8 +46,8 @@ def pytest_sessionfinish(session, exitstatus):
        except Exception as e:
            print(f"[conftest] aggregator {attr}() in {name} failed: {e}")
-    _exec("test_allreduce_multidevice.py", "_aggregate_sweep_plots")
+    _exec("sccl/_allreduce_helpers.py", "_aggregate_sweep_plots")
-    _exec("test_allreduce_buffer_kind_sweep.py", "aggregate_buffer_kind_plot")
+    _exec("sccl/_allreduce_helpers.py", "aggregate_buffer_kind_plot")
@pytest.fixture(scope="session")
@@ -0,0 +1,283 @@
 """Shared plotting plumbing for the GEMM figure tests.
 Not a test module (no ``test_`` prefix -> pytest does not collect it).
 Reads the committed ``docs/diagrams/gemm_sweep.json`` (produced by the heavy
 ``scripts/gemm_sweep.py`` sim sweep) and renders matplotlib PNGs into
 ``docs/diagrams/gemm_plots/``. No simulation here -> the figure tests are fast
 and run by default; regenerating the underlying data stays a manual script.
 Chart set (mirrors the GEMM MAC slides in scripts/build_overview_slides.py):
  - stage breakdown (load_ref operand staging)
  - MAC utilization — measured (load_ref)
  - MAC utilization — theoretical vs measured (load_ref)
 """
 from __future__ import annotations
 import json
 from pathlib import Path
 ROOT = Path(__file__).resolve().parent.parent.parent
 GEMM_SWEEP_JSON = ROOT / "docs" / "diagrams" / "gemm_sweep.json"
 GEMM_PLOTS_DIR = ROOT / "docs" / "diagrams" / "gemm_plots"
 # Shapes excluded from the figures (mirrors build_overview_slides).
 EXCLUDED_SHAPES = {(512, 512, 512)}
 # Stage bars shown (raw op_log stage_type keys) + display names + colors.
 STAGE_KEYS = ["DMA_READ", "FETCH", "GEMM", "DMA_WRITE"]
 STAGE_DISPLAY = {
    "DMA_READ":  "DMA in",
    "FETCH":     "Fetch",
    "GEMM":      "GEMM",
    "DMA_WRITE": "DMA out",
 }
 STAGE_COLORS = {
    "DMA_READ":  "#3B82F6",
    "FETCH":     "#10B981",
    "GEMM":      "#F59E0B",
    "DMA_WRITE": "#A855F7",
 }
 # MAC-utilization model constants (mirror build_overview_slides).
 _HBM_GBS = 256.0
 _BPE = 2
 _T_STAGE = 16.0
 _D_STAGES = 3
 _PLOT_VARIANT = "load_ref"
 def _load_sweep_data() -> dict:
    if not GEMM_SWEEP_JSON.exists():
        return {"rows": []}
    data = json.loads(GEMM_SWEEP_JSON.read_text())
    data["rows"] = [
        r for r in data.get("rows", [])
        if (r["M"], r["K"], r["N"]) not in EXCLUDED_SHAPES
    ]
    return data
 def _shape_label(r: dict) -> str:
    if r["M"] == r["K"] == r["N"]:
        return f"M=K=N={r['M']}"
    return f"M={r['M']} K={r['K']} N={r['N']}"
 def _under_tile(M, K, N, tile_M, tile_K, tile_N) -> bool:
    return M < tile_M or K < tile_K or N < tile_N
 def _xtick_labels(shape_labels, tile_counts, flagged) -> list[str]:
    out = []
    for lbl, tc, fl in zip(shape_labels, tile_counts, flagged):
        s = f"{lbl}\n({tc} tiles)"
        if fl:
            s += " *"
        out.append(s)
    return out
 def _grouped_bar_png(
    out_name: str, *, title: str, subtitle: str | None,
    shape_labels, tile_counts, flagged, series: dict, colors: dict,
    y_label: str, threshold: float | None = None, footnote: str | None = None,
 ) -> str:
    """Render one grouped-bar chart to GEMM_PLOTS_DIR/out_name; return the path."""
    import matplotlib.pyplot as plt
    import numpy as np
    n_groups = len(shape_labels)
    n_series = max(1, len(series))
    x = np.arange(n_groups)
    width = 0.8 / n_series
    fig, ax = plt.subplots(figsize=(11, 6))
    for i, (name, vals) in enumerate(series.items()):
        offset = (i - (n_series - 1) / 2) * width
        ax.bar(x + offset, vals, width, label=name, color=colors.get(name))
    ax.set_xticks(x)
    ax.set_xticklabels(
        _xtick_labels(shape_labels, tile_counts, flagged), fontsize=8,
    )
    ax.set_ylabel(y_label)
    ax.set_title(title, fontsize=13, fontweight="bold")
    if subtitle:
        ax.text(0.5, 1.01, subtitle, transform=ax.transAxes, ha="center",
                va="bottom", fontsize=8, color="#475569")
    if threshold is not None:
        ax.axhline(threshold, ls="--", color="gray", lw=1.0)
    ax.legend(fontsize=8, loc="upper right")
    ax.grid(True, axis="y", alpha=0.3)
    caption = "* = under-tile shape (M<TILE_M, K<TILE_K, or N<TILE_N)"
    if footnote:
        caption = footnote + "\n" + caption
    fig.text(0.5, 0.01, caption, ha="center", fontsize=7, color="gray",
             wrap=True)
    fig.tight_layout(rect=(0, 0.05, 1, 1))
    GEMM_PLOTS_DIR.mkdir(parents=True, exist_ok=True)
    out = GEMM_PLOTS_DIR / out_name
    fig.savefig(out, dpi=120)
    plt.close(fig)
    return str(out)
 # ── individual chart renderers (read sweep JSON, emit one PNG each) ─────
 def emit_stage_breakdown() -> str | None:
    """Per-stage engine wall-clock per shape (load_ref operand staging)."""
    data = _load_sweep_data()
    rows = [r for r in data["rows"] if r.get("variant") == _PLOT_VARIANT]
    if not rows:
        return None
    tile = data["tile_sizes"]
    shape_labels = [_shape_label(r) for r in rows]
    flagged = [_under_tile(r["M"], r["K"], r["N"], tile["M"], tile["K"], tile["N"])
               for r in rows]
    tile_counts = [r["tile_count_expected"] for r in rows]
    series = {
        STAGE_DISPLAY[s]: [r.get("stages", {}).get(s, {}).get("wall_ns", 0.0)
                           for r in rows]
        for s in STAGE_KEYS
    }
    colors = {STAGE_DISPLAY[s]: STAGE_COLORS[s] for s in STAGE_KEYS}
    return _grouped_bar_png(
        "gemm_stage_breakdown.png",
        title="GEMM stage breakdown",
        subtitle=(f"Per-stage engine wall-clock (DMA in / Fetch / GEMM / "
                  f"DMA out), {_PLOT_VARIANT} staging. "
                  f"Tile {tile['M']}x{tile['K']}x{tile['N']}."),
        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
        series=series, colors=colors, y_label="ns",
        footnote="Bars = engine wall-clock interval (merged overlaps).",
    )
 def emit_mac_utilization_measured() -> str | None:
    """GEMM util % and useful pipeline-eff % (analytical model, load_ref)."""
    data = _load_sweep_data()
    rows = data["rows"]
    if not rows:
        return None
    tile = data["tile_sizes"]
    TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
    tile_flops = 2 * TILE_M * TILE_K * TILE_N
    dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
    head_ns = (_D_STAGES - 1) * _T_STAGE
    by_shape = {(r["M"], r["K"], r["N"]): r
                for r in rows if r["variant"] == _PLOT_VARIANT}
    shapes = list(by_shape)
    if not shapes:
        return None
    shape_labels = [_shape_label(by_shape[k]) for k in shapes]
    flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
    tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
    gemm_util, useful_eff = [], []
    for k in shapes:
        r = by_shape[k]
        M, K, N = r["M"], r["K"], r["N"]
        useful = 2 * M * K * N
        tiles = r["tile_count_expected"]
        gu = useful / (tile_flops * tiles) * 100
        gemm_util.append(gu)
        m_tiles = (M + TILE_M - 1) // TILE_M
        n_tiles = (N + TILE_N - 1) // TILE_N
        n_mn = m_tiles * n_tiles
        compute_total = tiles * _T_STAGE
        wall = head_ns + tiles * _T_STAGE + max(0, n_mn - 1) * dma_w_per_pair
        ueff = (compute_total * (gu / 100.0) / wall) * 100 if wall > 0 else 0.0
        useful_eff.append(ueff)
    series = {"GEMM util %": gemm_util, "Useful eff %": useful_eff}
    colors = {"GEMM util %": "#10B981", "Useful eff %": "#F59E0B"}
    return _grouped_bar_png(
        "gemm_mac_utilization_measured.png",
        title="GEMM MAC utilization — load_ref",
        subtitle=("GEMM util = useful FLOPs / (tile FLOPs x tiles); "
                  "Useful eff = GEMM util x ideal pipeline efficiency."),
        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
        series=series, colors=colors, y_label="%", threshold=100.0,
        footnote="Theoretical ideal-pipeline model (not simulator data).",
    )
 def emit_mac_utilization_theoretical_vs_measured() -> str | None:
    """Theoretical vs simulator-measured GEMM util / useful eff (load_ref)."""
    data = _load_sweep_data()
    rows = data["rows"]
    if not rows:
        return None
    tile = data["tile_sizes"]
    TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
    tile_flops = 2 * TILE_M * TILE_K * TILE_N
    dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
    head_ns = (_D_STAGES - 1) * _T_STAGE
    peak_per_ns = tile_flops / _T_STAGE
    by_shape = {(r["M"], r["K"], r["N"]): r
                for r in rows if r["variant"] == _PLOT_VARIANT}
    shapes = list(by_shape)
    if not shapes:
        return None
    shape_labels = [_shape_label(by_shape[k]) for k in shapes]
    flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
    tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
    gu_t, gu_m, eff_t, eff_m = [], [], [], []
    for k in shapes:
        r = by_shape[k]
        M, K, N = r["M"], r["K"], r["N"]
        useful = 2 * M * K * N
        tiles = r["tile_count_expected"]
        gut = useful / (tile_flops * tiles)
        gu_t.append(gut * 100)
        rec = r.get("stages", {}).get("GEMM", {}).get("record_count", 0) or tiles
        gu_m.append((useful / (tile_flops * rec) * 100) if rec else 0.0)
        m_tiles = (M + TILE_M - 1) // TILE_M
        n_tiles = (N + TILE_N - 1) // TILE_N
        n_mn = m_tiles * n_tiles
        compute_total = tiles * _T_STAGE
        wall_t = head_ns + compute_total + max(0, n_mn - 1) * dma_w_per_pair
        eff_t.append((compute_total * gut / wall_t * 100) if wall_t > 0 else 0.0)
        cw = r.get("composite_window_ns", 0.0) or 0.0
        eff_m.append((useful / cw / peak_per_ns * 100) if cw > 0 else 0.0)
    series = {
        "GEMM util % (theoretical)": gu_t,
        "GEMM util % (measured)":    gu_m,
        "Theoretical eff %":         eff_t,
        "Measured eff %":            eff_m,
    }
    colors = {
        "GEMM util % (theoretical)": "#10B981",
        "GEMM util % (measured)":    "#6EE7B7",
        "Theoretical eff %":         "#F59E0B",
        "Measured eff %":            "#3B82F6",
    }
    return _grouped_bar_png(
        "gemm_mac_utilization_theoretical_vs_measured.png",
        title="GEMM MAC utilization — theoretical vs measured (load_ref)",
        subtitle=("theoretical model vs simulator op_log; agreement "
                  "validates the analytical pipeline model."),
        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
        series=series, colors=colors, y_label="%", threshold=100.0,
    )
 def emit_all_gemm_plots() -> list[str]:
    """Render every GEMM figure that has data; return the list of paths written."""
    paths = []
    for fn in (emit_stage_breakdown,
               emit_mac_utilization_measured,
               emit_mac_utilization_theoretical_vs_measured):
        p = fn()
        if p:
            paths.append(p)
    return paths
@@ -0,0 +1,36 @@
 """Regenerate docs/diagrams/gemm_sweep.json by running the GEMM sweep.
 Heavy: drives matmul-composite across all shapes x variants through the
 simulator (24 runs; the 512 shape alone is 2048 tiles). Marked ``slow`` so it
 is excluded from the default ``pytest`` run (addopts: -m "not slow") and runs
 on demand:
    pytest -m slow tests/gemm/test_gemm_sweep.py
 Delegates to scripts/gemm_sweep.py (the single source of the sweep logic) via
 subprocess so there is no duplicated sim-driving code.
 """
 from __future__ import annotations
 import subprocess
 import sys
 from pathlib import Path
 import pytest
 from tests.gemm._gemm_plot_helpers import GEMM_SWEEP_JSON, ROOT
@pytest.mark.slow
 def test_gemm_sweep_regenerates_json():
    script = ROOT / "scripts" / "gemm_sweep.py"
    assert script.exists(), f"missing {script}"
    proc = subprocess.run(
        [sys.executable, str(script)],
        cwd=str(ROOT), capture_output=True, text=True,
    )
    assert proc.returncode == 0, (
        f"gemm_sweep.py failed (rc={proc.returncode})\n"
        f"stdout:\n{proc.stdout[-2000:]}\nstderr:\n{proc.stderr[-2000:]}"
    )
    assert Path(GEMM_SWEEP_JSON).exists()
@@ -0,0 +1,35 @@
 """Emit the GEMM MAC-utilization bar charts.
 A measured chart (load_ref) plus the theoretical-vs-measured overlay (load_ref).
 Reads docs/diagrams/gemm_sweep.json and writes gemm_mac_utilization*.png into
 docs/diagrams/gemm_plots/.
 """
 from __future__ import annotations
 from pathlib import Path
 import pytest
 from tests.gemm._gemm_plot_helpers import (
    GEMM_SWEEP_JSON,
    emit_mac_utilization_measured,
    emit_mac_utilization_theoretical_vs_measured,
 )
@pytest.mark.skipif(
    not GEMM_SWEEP_JSON.exists(),
    reason="gemm_sweep.json absent; run scripts/gemm_sweep.py first",
 )
 def test_plot_gemm_mac_utilization_measured():
    out = emit_mac_utilization_measured()
    assert out is not None and Path(out).exists()
@pytest.mark.skipif(
    not GEMM_SWEEP_JSON.exists(),
    reason="gemm_sweep.json absent; run scripts/gemm_sweep.py first",
 )
 def test_plot_gemm_mac_utilization_theoretical_vs_measured():
    out = emit_mac_utilization_theoretical_vs_measured()
    assert out is not None and Path(out).exists()
@@ -0,0 +1,24 @@
 """Emit the GEMM per-stage engine wall-clock bar chart (load_ref).
 Reads docs/diagrams/gemm_sweep.json (run scripts/gemm_sweep.py to refresh it)
 and writes gemm_stage_breakdown.png into docs/diagrams/gemm_plots/.
 """
 from __future__ import annotations
 from pathlib import Path
 import pytest
 from tests.gemm._gemm_plot_helpers import (
    GEMM_SWEEP_JSON,
    emit_stage_breakdown,
 )
@pytest.mark.skipif(
    not GEMM_SWEEP_JSON.exists(),
    reason="gemm_sweep.json absent; run scripts/gemm_sweep.py first",
 )
 def test_plot_gemm_stage_breakdown():
    out = emit_stage_breakdown()
    assert out is not None and Path(out).exists()
@@ -1,25 +1,193 @@
-"""Config-driven multi-device allreduce test application.
+"""Shared plumbing for the sccl allreduce tests.
-Reads ``ccl.yaml`` + ``topology.yaml``, dynamically loads the kernel
+Not a test module (no ``test_`` prefix → pytest does not collect it).
-module from ``ccl.yaml → module``, and picks the inter-SIP exchange
+Holds the distributed driver, the direct-launch parity reference, the
-pattern from ``topology.yaml → system.sips.topology``.
+config writers, the sweep/buffer-kind constants, the plot aggregators
-
+(called from ``conftest.pytest_sessionfinish``), and the topology-diagram
-Run directly::
+emitter. The per-test files under ``tests/sccl/`` import from here, as do
-
+the external buffer-kind / root-center tests under ``tests/``.
    python -m pytest tests/allreduce_app.py -v -s
 """
 from __future__ import annotations
 import importlib
 import math
 import textwrap
 from pathlib import Path
 from typing import Any
 import numpy as np
 import pytest
 import yaml
 from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
 from kernbench.ccl.sfr_config import configure_sfr_intercube_multisip
 from kernbench.policy.placement.dp import DPPolicy
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 TOPOLOGY_PATH = Path(__file__).parent.parent.parent / "topology.yaml"
 DEFAULT_N_ELEM = 8
 # ── config writers ────────────────────────────────────────────────────
 def _write_ccl_yaml(tmp_path) -> str:
    body = textwrap.dedent("""\
        defaults:
          algorithm: lrab_hierarchical_allreduce
          buffer_kind: tcm
          backpressure: sleep
          n_slots: 4
          slot_size: 4096
          vc_chunk_size: 256
          ipcq_credit_size_bytes: 16
        algorithms:
          lrab_hierarchical_allreduce:
            module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
            topology: none
            buffer_kind: tcm
            n_elem: 8
            root_cube: 15
    """)
    (tmp_path / "ccl.yaml").write_text(body)
    return str(tmp_path)
 def _write_temp_configs(
    tmp_path, sip_topology, n_sips, algorithm, n_elem_override=None,
    sip_w=None, sip_h=None,
 ):
    """Write temp topology.yaml and ccl.yaml with the given overrides."""
    with open(TOPOLOGY_PATH) as f:
        topo_cfg = yaml.safe_load(f)
    topo_cfg["system"]["sips"]["count"] = n_sips
    topo_cfg["system"]["sips"]["topology"] = sip_topology
    if sip_w is not None and sip_h is not None:
        topo_cfg["system"]["sips"]["w"] = int(sip_w)
        topo_cfg["system"]["sips"]["h"] = int(sip_h)
    else:
        topo_cfg["system"]["sips"].pop("w", None)
        topo_cfg["system"]["sips"].pop("h", None)
    topo_path = tmp_path / "topology.yaml"
    with open(topo_path, "w") as f:
        yaml.dump(topo_cfg, f, default_flow_style=False)
    ccl_path = Path(__file__).parent.parent.parent / "ccl.yaml"
    with open(ccl_path) as f:
        ccl_cfg = yaml.safe_load(f)
    ccl_cfg["defaults"]["algorithm"] = algorithm
    if n_elem_override is not None:
        ccl_cfg.setdefault("algorithms", {}).setdefault(
            algorithm, {},
        )["n_elem"] = int(n_elem_override)
        # Ensure IPCQ slot is big enough for the per-message payload.
        per_msg_bytes = int(n_elem_override) * 2  # f16
        default_slot = int(ccl_cfg["defaults"].get("slot_size", 4096))
        if per_msg_bytes > default_slot:
            ccl_cfg["defaults"]["slot_size"] = per_msg_bytes
    tmp_ccl = tmp_path / "ccl.yaml"
    with open(tmp_ccl, "w") as f:
        yaml.dump(ccl_cfg, f, default_flow_style=False)
    return str(topo_path), str(tmp_ccl)
 # ── distributed driver (init_process_group → mp.spawn → all_reduce) ────
 def _worker(rank: int, n_cubes: int, n_elem: int, n_sips: int, torch) -> None:
    """Per-SIP worker: allocate, fill, all_reduce, verify."""
    torch.ahbm.set_device(rank)
    dp = DPPolicy(
        cube="row_wise", pe="replicate",
        num_pes=1, num_cubes=n_cubes,
    )
    tensor = torch.zeros(
        (n_cubes, n_elem), dtype="f16", dp=dp,
        name=f"sip{rank}",
    )
    tensor.copy_(torch.from_numpy(
        np.full((n_cubes, n_elem), float(rank + 1), dtype=np.float16)
    ))
    torch.distributed.all_reduce(tensor, op="sum")
    arr = tensor.numpy()
    expected = float(n_cubes * sum(range(1, n_sips + 1)))
    for cube_id in range(n_cubes):
        assert np.allclose(arr[cube_id], expected, rtol=1e-1, atol=1e-1), (
            f"SIP{rank} cube {cube_id}: "
            f"got {arr[cube_id][:4]}, expected {expected}"
        )
    if rank == 0:
        print(f"\n  lrab_hierarchical_allreduce (ws={n_sips}): "
              f"{n_sips * n_cubes} OK")
 def _crit_ns(engine) -> float:
    """Critical-path latency = max per-result pe_exec_ns over engine results."""
    vals = [
        float(tr.get("pe_exec_ns", 0.0) or 0.0)
        for _, (_, tr) in engine._results.items()
        if isinstance(tr, dict)
    ]
    return max(vals) if vals else 0.0
 def _run_distributed(tmp_path, monkeypatch, topo_path, correlation_id, n_elem):
    """Build engine + run the collective via the full distributed path.
    Returns ``(engine, n_cubes)``. ``monkeypatch.chdir`` points the backend's
    ``load_ccl_config()`` (cwd lookup) at the temp ``ccl.yaml``.
    """
    monkeypatch.chdir(tmp_path)
    topo = resolve_topology(topo_path)
    engine = GraphEngine(topo.topology_obj, enable_data=True)
    spec = topo.topology_obj.spec
    n_sips = int(spec["system"]["sips"]["count"])
    cm = spec["sip"]["cube_mesh"]
    n_cubes = int(cm["w"]) * int(cm["h"])
    with RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id=correlation_id,
        spec=spec,
    ) as ctx:
        ctx.distributed.init_process_group(backend="ahbm")
        assert ctx.distributed.get_world_size() == n_sips
        ctx.multiprocessing.spawn(
            _worker, args=(n_cubes, n_elem, n_sips, ctx), nprocs=n_sips,
        )
    return engine, n_cubes
 # ── correctness config matrix (used by test_allreduce) ─────────────────
 CONFIGS = [
    pytest.param(
        "lrab_hierarchical_allreduce", "ring_1d", 6, None, None,
        id="ring_6sip",
    ),
    pytest.param(
        "lrab_hierarchical_allreduce", "torus_2d", 6, 2, 3,
        id="torus_6sip_2x3",
    ),
    pytest.param(
        "lrab_hierarchical_allreduce", "mesh_2d_no_wrap", 6, 2, 3,
        id="mesh_6sip_2x3",
    ),
 ]
 # ── direct-launch helper (parity reference only) ───────────────────────
 def _sip_topo_dims(
@@ -51,14 +219,14 @@ def run_allreduce(
    algorithm: str | None = None,
    ccl_yaml: str | None = None,
 ) -> dict:
-    """Config-driven allreduce: read yaml, load kernel, run.
+    """Config-driven allreduce via direct ctx.launch (no distributed wrapper).
-    Everything is resolved from config — no hardcoded kernel imports.
+    Retained as the parity reference for the distributed path and reused by
    the external buffer-kind / root-center micro-tests.
    """
    cfg_all = load_ccl_config(ccl_yaml)
    cfg = resolve_algorithm_config(cfg_all, algorithm)
    # Dynamic import from ccl.yaml → module
    algo_module = importlib.import_module(cfg["module"])
    kernel_fn = algo_module.kernel
    topo_name_to_kind = algo_module.TOPO_NAME_TO_KIND
@@ -83,15 +251,6 @@ def run_allreduce(
    )
    algo_name = cfg.get("algorithm", "allreduce")
    print(f"\n{'=' * 60}")
    print(f"algorithm:        {algo_name}")
    print(f"module:           {cfg['module']}")
    print(f"sip_topology:     {sip_topo}")
    print(f"kernel:           {kernel_fn.__name__}")
    print(f"n_sips:           {n_sips}")
    print(f"n_cubes:          {n_cubes}")
    print(f"n_elem:           {n_elem}")
    print(f"{'=' * 60}")
    configure_sfr_intercube_multisip(engine, spec, cfg)
@@ -112,11 +271,6 @@ def run_allreduce(
        ))
        tensors.append(t)
    for sip in range(n_sips):
        arr = tensors[sip].numpy()
        print(f"[SIP {sip}] input  cube0[:4] = {arr[0][:4].tolist()}  "
              f"cube{n_cubes - 1}[:4] = {arr[-1][:4].tolist()}")
    t_start = engine._env.now
    all_pending = []
@@ -129,31 +283,14 @@ def run_allreduce(
        )
        all_pending.extend(pending)
-    for h, sip_id, meta in all_pending:
+    for h, _sip_id, meta in all_pending:
        ctx.wait(h, _meta=meta)
    t_end = engine._env.now
    latency_ns = t_end - t_start
    print(f"\n[{algo_name} ws={n_sips}] sim latency = "
          f"{latency_ns:.1f} ns ({latency_ns / 1000:.3f} us)")
    for key, (_, trace) in engine._results.items():
        if not isinstance(trace, dict):
            continue
        total = trace.get("total_ns", 0.0)
        pe_exec = trace.get("pe_exec_ns", 0.0) or 0.0
        network = total - pe_exec
        print(f"  [{key}] total={total:.1f} ns  "
              f"pe_exec={pe_exec:.1f} ns  network={network:.1f} ns")
    expected = float(n_cubes * sum(range(1, n_sips + 1)))
    print()
    for sip in range(n_sips):
        arr = tensors[sip].numpy()
        print(f"[SIP {sip}] output cube0[:4]  = {arr[0][:4].tolist()}")
        print(f"[SIP {sip}] output cube{n_cubes - 1}[:4] = {arr[-1][:4].tolist()}")
    ok_cubes = 0
    for sip in range(n_sips):
        arr = tensors[sip].numpy()
@@ -166,8 +303,6 @@ def run_allreduce(
            )
            ok_cubes += 1
    print(f"\n  {algo_name} (ws={n_sips}): {ok_cubes} OK")
    return {
        "expected": expected,
        "latency_ns": latency_ns,
@@ -175,101 +310,7 @@ def run_allreduce(
    }
-# ── pytest entry point ───────────────────────────────────────────────
+# ── Latency sweep constants + aggregator ──────────────────────────────
 import pytest
 import yaml
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
 CONFIGS = [
    pytest.param(
        "intercube_allreduce", "ring_1d", 6, None, None,
        id="ring_6sip",
    ),
    pytest.param(
        "intercube_allreduce", "torus_2d", 6, 2, 3,
        id="torus_6sip_2x3",
    ),
    pytest.param(
        "intercube_allreduce", "mesh_2d_no_wrap", 6, 2, 3,
        id="mesh_6sip_2x3",
    ),
 ]
 def _write_temp_configs(
    tmp_path, sip_topology, n_sips, algorithm, n_elem_override=None,
    sip_w=None, sip_h=None,
 ):
    """Write temp topology.yaml and ccl.yaml with the given overrides."""
    with open(TOPOLOGY_PATH) as f:
        topo_cfg = yaml.safe_load(f)
    topo_cfg["system"]["sips"]["count"] = n_sips
    topo_cfg["system"]["sips"]["topology"] = sip_topology
    if sip_w is not None and sip_h is not None:
        topo_cfg["system"]["sips"]["w"] = int(sip_w)
        topo_cfg["system"]["sips"]["h"] = int(sip_h)
    else:
        topo_cfg["system"]["sips"].pop("w", None)
        topo_cfg["system"]["sips"].pop("h", None)
    topo_path = tmp_path / "topology.yaml"
    with open(topo_path, "w") as f:
        yaml.dump(topo_cfg, f, default_flow_style=False)
    ccl_path = Path(__file__).parent.parent / "ccl.yaml"
    with open(ccl_path) as f:
        ccl_cfg = yaml.safe_load(f)
    ccl_cfg["defaults"]["algorithm"] = algorithm
    if n_elem_override is not None:
        ccl_cfg.setdefault("algorithms", {}).setdefault(
            algorithm, {},
        )["n_elem"] = int(n_elem_override)
        # Ensure IPCQ slot is big enough for the per-message payload.
        per_msg_bytes = int(n_elem_override) * 2  # f16
        default_slot = int(ccl_cfg["defaults"].get("slot_size", 4096))
        if per_msg_bytes > default_slot:
            ccl_cfg["defaults"]["slot_size"] = per_msg_bytes
    tmp_ccl = tmp_path / "ccl.yaml"
    with open(tmp_ccl, "w") as f:
        yaml.dump(ccl_cfg, f, default_flow_style=False)
    return str(topo_path), str(tmp_ccl)
@pytest.mark.parametrize(
    "algorithm,sip_topology,n_sips,sip_w,sip_h", CONFIGS,
 )
 def test_allreduce(
    tmp_path, algorithm, sip_topology, n_sips, sip_w, sip_h,
 ):
    topo_path, ccl_path = _write_temp_configs(
        tmp_path, sip_topology, n_sips, algorithm,
        sip_w=sip_w, sip_h=sip_h,
    )
    topo = resolve_topology(topo_path)
    engine = GraphEngine(topo.topology_obj, enable_data=True)
    spec = topo.topology_obj.spec
    with RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id=f"test_{algorithm}_{sip_topology}",
        spec=spec,
    ) as ctx:
        result = run_allreduce(
            ctx, engine, spec,
            algorithm=algorithm, ccl_yaml=ccl_path,
        )
        assert result["ok_cubes"] > 0
 # ── Latency sweep (parametrized + xdist-friendly) ─────────────────────
 # avoid 16 (== n_cubes, dim_map collision). Goes up to 96 KB per PE:
 # bytes_per_pe = n_elem * 2 (f16). 49152 elem * 2 = 96 KB / PE.
@@ -280,16 +321,16 @@ _SWEEP_N_ELEM = [
 _ELEM_BYTES_F16 = 2
 _SWEEP_TOPOLOGIES = [
-    ("intercube_allreduce", "ring_1d", 6, None, None),
+    ("lrab_hierarchical_allreduce", "ring_1d", 6, None, None),
-    ("intercube_allreduce", "torus_2d", 6, 2, 3),
+    ("lrab_hierarchical_allreduce", "torus_2d", 6, 2, 3),
-    ("intercube_allreduce", "mesh_2d_no_wrap", 6, 2, 3),
+    ("lrab_hierarchical_allreduce", "mesh_2d_no_wrap", 6, 2, 3),
 ]
 # Shared on-disk staging dir for parametrized sweep rows. Each
 # parametrized invocation writes one JSON file here; the aggregator
 # (run from conftest.pytest_sessionfinish) reads them and emits the
 # combined CSV + PNG plots.
-_SWEEP_OUT_DIR = (Path(__file__).parent.parent / "docs" / "diagrams"
+_SWEEP_OUT_DIR = (Path(__file__).parent.parent.parent / "docs" / "diagrams"
                  / "allreduce_latency_plots")
 _SWEEP_ROWS_DIR = _SWEEP_OUT_DIR / "_rows"
@@ -305,69 +346,6 @@ def _sweep_params():
    return out
@pytest.mark.parametrize(
    "algorithm,sip_topology,n_sips,sip_w,sip_h,n_elem", _sweep_params(),
 )
 def test_allreduce_latency_one(
    tmp_path, algorithm, sip_topology, n_sips, sip_w, sip_h, n_elem,
 ):
    """One config of the latency sweep. xdist parallelizes across params.
    Writes a single JSON row to ``_SWEEP_ROWS_DIR``. The conftest
    sessionfinish hook aggregates rows into CSV + plots after all
    parametrized cases finish.
    """
    import json
    topo_path, ccl_path = _write_temp_configs(
        tmp_path, sip_topology, n_sips, algorithm,
        sip_w=sip_w, sip_h=sip_h,
        n_elem_override=n_elem,
    )
    topo = resolve_topology(topo_path)
    engine = GraphEngine(topo.topology_obj, enable_data=True)
    spec = topo.topology_obj.spec
    with RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id=f"sweep_{algorithm}_{sip_topology}_{n_elem}",
        spec=spec,
    ) as ctx:
        result = run_allreduce(
            ctx, engine, spec,
            algorithm=algorithm, ccl_yaml=ccl_path,
        )
        assert result["ok_cubes"] > 0
    pe_exec_vals = [
        float(tr.get("pe_exec_ns", 0.0) or 0.0)
        for _, (_, tr) in engine._results.items()
        if isinstance(tr, dict)
    ]
    crit_ns = max(pe_exec_vals) if pe_exec_vals else 0.0
    cm = spec["sip"]["cube_mesh"]
    n_cubes = int(cm["w"]) * int(cm["h"])
    bytes_per_sip = n_cubes * n_elem * _ELEM_BYTES_F16
    bytes_per_pe = n_elem * _ELEM_BYTES_F16
    record = {
        "algorithm": algorithm,
        "sip_topology": sip_topology,
        "n_sips": n_sips,
        "n_elem": n_elem,
        "bytes_per_pe": bytes_per_pe,
        "bytes_per_sip": bytes_per_sip,
        "latency_ns": crit_ns,
    }
    _SWEEP_ROWS_DIR.mkdir(parents=True, exist_ok=True)
    row_path = _SWEEP_ROWS_DIR / f"{sip_topology}_{n_elem}.json"
    with open(row_path, "w", encoding="utf-8") as f:
        json.dump(record, f)
 def _aggregate_sweep_plots() -> bool:
    """Read all per-config rows and emit CSV + PNG plots.
@@ -440,10 +418,22 @@ def _aggregate_sweep_plots() -> bool:
            continue
        xs = [r["bytes_per_pe"] for r in rs]
        ys = [r["latency_ns"] for r in rs]
-        title = (
+        _per_topo_titles = {
-            f"Allreduce latency — {topo_name} "
+            "ring_1d": "AllReduce_LRAB_Ring1D_6SiP(1x6)",
-            f"(n_sips={rs[0]['n_sips']})"
+            "torus_2d": "AllReduce_LRAB_2Dtorus_6SiP(2x3)",
            "mesh_2d_no_wrap": "AllReduce_LRAB_2DMesh_6SiP(2x3)",
        }
        # Descriptive output filenames (parens → underscores for
        # markdown/URL safety; topo key stays the summary.csv value).
        _per_topo_files = {
            "ring_1d": "AllReduce_LRAB_Ring1D_6SiP_1x6",
            "torus_2d": "AllReduce_LRAB_2Dtorus_6SiP_2x3",
            "mesh_2d_no_wrap": "AllReduce_LRAB_2DMesh_6SiP_2x3",
        }
        title = _per_topo_titles.get(
            topo_name, f"Allreduce latency — {topo_name}"
        )
        out_stem = _per_topo_files.get(topo_name, topo_name)
        fig, ax = plt.subplots(figsize=(8, 5))
        ax.plot(xs, ys, marker="o", color="tab:blue")
        ax.set_xscale("log", base=2)
@@ -453,75 +443,14 @@ def _aggregate_sweep_plots() -> bool:
        ax.grid(True, alpha=0.3)
        ax.xaxis.set_major_formatter(_bytes_fmt)
        fig.tight_layout()
-        fig.savefig(_SWEEP_OUT_DIR / f"{topo_name}.png", dpi=120)
+        fig.savefig(_SWEEP_OUT_DIR / f"{out_stem}.png", dpi=120)
        plt.close(fig)
-    colors = {"ring_1d": "tab:blue", "torus_2d": "tab:orange",
+    # Combined overview.png is no longer emitted — the broken-y-axis
-              "mesh_2d_no_wrap": "tab:green"}
+    # comparison (emit_comparison_fsim_plot() below →
-
+    # comparison_mesh_vs_ring_vs_2DTorus_vs_theoretical_vs_fsim.png)
-    # ── Hand-derived theoretical model for torus_2d (6 SIPs) ──
+    # supersedes it. Per-topology plots above and summary.csv are still
-    # Critical-path analysis (per packet, packet = 128 B at NoC):
+    # produced.
    #   local intra-SIP reduce + broadcast = 8 hops × 57 ns = 456 ns
    #   global X-direction reduce          = 5 UCIe + 1 UAL = 445 ns
    #   global Y-direction reduce          = 5 UCIe + 1 UAL = 445 ns
    #   per-packet startup latency         = 456 + 445 + 445 = 1346 ns
    # Packet count is PER CUBE (8 PEs/cube cooperate on the cube tile).
    # At 6144 packets/cube the pipelined total is 8741 ns, so the
    # bottleneck-stage interval τ = (8741 − 1346) / (6144 − 1) ≈ 1.204 ns.
    # T_theoretical(N) = 1346 + (N − 1) × τ
    #   where N = ceil((bytes_per_pe × 8) / 128) = ceil(bytes_per_pe / 16)
    NOC_PACKET_BYTES = 128
    PES_PER_CUBE = 8
    T_STARTUP_NS = 1346.0
    TAU_NS = (8741.0 - 1346.0) / (6144 - 1)  # ≈ 1.2038 ns/packet
    def _theoretical_torus_2d_ns(bytes_per_pe: int) -> float:
        bytes_per_cube = int(bytes_per_pe) * PES_PER_CUBE
        n_packets = max(1, -(-bytes_per_cube // NOC_PACKET_BYTES))  # ceil
        return T_STARTUP_NS + (n_packets - 1) * TAU_NS
    fig, ax = plt.subplots(figsize=(9, 6))
    for topo_name in topologies:
        rs = sorted(
            [r for r in records if r["sip_topology"] == topo_name],
            key=lambda r: r["bytes_per_pe"],
        )
        if not rs:
            continue
        ax.plot(
            [r["bytes_per_pe"] for r in rs],
            [r["latency_ns"] for r in rs],
            marker="o",
            label=f"{topo_name} (n_sips={rs[0]['n_sips']})",
            color=colors.get(topo_name),
        )
    # Theoretical torus_2d curve across all payload sizes.
    torus_rs = sorted(
        [r for r in records if r["sip_topology"] == "torus_2d"],
        key=lambda r: r["bytes_per_pe"],
    )
    if torus_rs:
        xs_th = [r["bytes_per_pe"] for r in torus_rs]
        ys_th = [_theoretical_torus_2d_ns(r["bytes_per_pe"]) for r in torus_rs]
        ax.plot(
            xs_th, ys_th,
            color="tab:red", linestyle="--", linewidth=1.6, marker="x",
            label="theoretical torus_2d (6 SIPs)",
        )
    ax.set_xscale("log", base=2)
    ax.set_xlabel("Bytes per PE (log scale)")
    ax.set_ylabel("Time (ns)")
    ax.set_title("Multi-device allreduce latency by topology")
    ax.grid(True, alpha=0.3)
    ax.set_xlim(left=min(r["bytes_per_pe"] for r in records) / 2,
                right=max(r["bytes_per_pe"] for r in records) * 1.5)
    ax.legend()
    ax.xaxis.set_major_formatter(_bytes_fmt)
    fig.tight_layout()
    fig.savefig(_SWEEP_OUT_DIR / "overview.png", dpi=120)
    plt.close(fig)
    # Cleanup row staging dir so a partial future run doesn't pick up
    # stale rows.
@@ -535,7 +464,119 @@ def _aggregate_sweep_plots() -> bool:
    except OSError:
        pass
-    print(f"\nWrote {_SWEEP_OUT_DIR / 'overview.png'} "
+    print(f"\nWrote per-topology plots + summary.csv to {_SWEEP_OUT_DIR} "
          f"from {len(records)} rows")
    return True
 # ── Buffer-kind sweep constants + aggregator ──────────────────────────
 #
 # Parametrized over (buffer_kind, n_elem) on torus_2d 6 SIPs (3×2). Pre
 # slot-latency modeling the three lines overlap exactly (slot access is
 # latency-free today); they spread out once tcm/sram/hbm carry distinct
 # access costs.
 _BUFFER_KINDS = ["tcm", "sram", "hbm"]
 _BK_N_ELEM_GRID = [128, 1024, 8192, 32768]   # 256 B → 64 KB per slot
 _BK_ROWS_DIR = _SWEEP_OUT_DIR / "_buffer_kind_rows"
 # Descriptive output stem (shared by the .png and .csv).
 _BK_OUT_STEM = "AllReduce_LRAB_2Dtorus_6SiP_2x3_with_TCM_SRAM_HBM"
 def _bk_params():
    out = []
    for bk in _BUFFER_KINDS:
        for n_elem in _BK_N_ELEM_GRID:
            out.append(pytest.param(bk, n_elem, id=f"{bk}-n_elem{n_elem}"))
    return out
 def aggregate_buffer_kind_plot() -> bool:
    """Read per-config rows and emit the descriptive .png + .csv (_BK_OUT_STEM).
    Called from conftest.pytest_sessionfinish (controller-only).
    Returns True if rows were aggregated.
    """
    import csv
    import json
    if not _BK_ROWS_DIR.exists():
        return False
    row_files = sorted(_BK_ROWS_DIR.glob("*.json"))
    if not row_files:
        return False
    records = []
    for p in row_files:
        with open(p, encoding="utf-8") as f:
            records.append(json.load(f))
    import matplotlib.pyplot as plt
    from matplotlib.ticker import FuncFormatter
    def _fmt_bytes(x, _pos):
        if x <= 0:
            return "0"
        if x >= 1024 * 1024:
            return f"{x / (1024 * 1024):.0f} MB"
        if x >= 1024:
            return f"{x / 1024:.0f} KB"
        return f"{x:.0f} B"
    _bytes_fmt = FuncFormatter(_fmt_bytes)
    _SWEEP_OUT_DIR.mkdir(parents=True, exist_ok=True)
    with open(_SWEEP_OUT_DIR / f"{_BK_OUT_STEM}.csv", "w",
              newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=[
            "buffer_kind", "sip_topology", "n_sips", "n_elem",
            "bytes_per_pe", "latency_ns",
        ])
        w.writeheader()
        for r in sorted(records, key=lambda r: (
            r["buffer_kind"], r["bytes_per_pe"],
        )):
            w.writerow(r)
    colors = {"tcm": "tab:blue", "sram": "tab:orange", "hbm": "tab:red"}
    fig, ax = plt.subplots(figsize=(10, 6))
    for bk in ["tcm", "sram", "hbm"]:
        rs = sorted(
            [r for r in records if r["buffer_kind"] == bk],
            key=lambda r: r["bytes_per_pe"],
        )
        if not rs:
            continue
        ax.plot(
            [r["bytes_per_pe"] for r in rs],
            [r["latency_ns"] for r in rs],
            marker="o", lw=2.0,
            color=colors[bk], label=f"buffer_kind = {bk}",
        )
    ax.set_xscale("log", base=2)
    ax.set_xlabel("Bytes per PE (log scale)")
    ax.set_ylabel("Time (ns)")
    ax.set_title(
        "AllReduce_LRAB_2Dtorus_6SiP(2x3) — IPCQ memory (SRAM, TCM, HBM)"
    )
    ax.grid(True, alpha=0.3)
    ax.legend()
    ax.xaxis.set_major_formatter(_bytes_fmt)
    fig.tight_layout()
    fig.savefig(_SWEEP_OUT_DIR / f"{_BK_OUT_STEM}.png", dpi=130)
    plt.close(fig)
    for p in row_files:
        try:
            p.unlink()
        except OSError:
            pass
    try:
        _BK_ROWS_DIR.rmdir()
    except OSError:
        pass
    print(f"\nWrote {_SWEEP_OUT_DIR / f'{_BK_OUT_STEM}.png'} "
          f"from {len(records)} rows")
    return True
@@ -830,7 +871,143 @@ def emit_topology_diagram() -> str:
    return str(out_path)
-def test_emit_topology_diagram():
+# ── Comparison vs FSIM (broken-y-axis) ────────────────────────────────
-    """Emit topology.png alongside the sweep plots. Pure plotting; no sim."""
+#
-    out = emit_topology_diagram()
+# Post-processes summary.csv: today's three model curves + a hand-derived
-    assert Path(out).exists()
+# theoretical torus_2d line in the bottom panel, and a single external FSIM
 # single-device reference marker in the top panel (hardcoded 366 µs; no
 # external data file). Reads summary.csv written by _aggregate_sweep_plots.
 _FSIM_EXT_LABEL = "FSIM (single device): 366 µs"
 _FSIM_EXT_LATENCY_NS = 366_000.0
 _CMP_COLORS = {
    "ring_1d": "tab:blue",
    "torus_2d": "tab:orange",
    "mesh_2d_no_wrap": "tab:green",
 }
 _CMP_DISPLAY = {
    "ring_1d": "Ring 1x6 (6 devices)",
    "torus_2d": "2D Torus 2x3 (6 devices)",
    "mesh_2d_no_wrap": "2D Mesh 2x3 (6 devices)",
 }
 # Hand-derived theoretical model for torus_2d (6 SIPs): per-PE NOC-packet
 # count fit to the simulated startup + per-packet tau.
 _CMP_NOC_PACKET_BYTES = 128
 _CMP_PES_PER_CUBE = 8
 _CMP_T_STARTUP_NS = 1346.0
 _CMP_TAU_NS = (8741.0 - 1346.0) / (6144 - 1)
 def emit_comparison_fsim_plot() -> str | None:
    """Render comparison_mesh_vs_ring_vs_2DTorus_vs_theoretical_vs_fsim.png.
    Reads ``summary.csv`` (written by ``_aggregate_sweep_plots``). Returns the
    output path, or ``None`` if summary.csv is absent / empty.
    """
    import csv
    csv_path = _SWEEP_OUT_DIR / "summary.csv"
    if not csv_path.exists():
        return None
    records = []
    with open(csv_path, newline="", encoding="utf-8") as f:
        for row in csv.DictReader(f):
            records.append({
                "sip_topology": row["sip_topology"],
                "bytes_per_pe": int(row["bytes_per_pe"]),
                "latency_ns": float(row["latency_ns"]),
            })
    if not records:
        return None
    import matplotlib.pyplot as plt
    import matplotlib.ticker as mticker
    def _theoretical_torus_2d_ns(bytes_per_pe: int) -> float:
        bytes_per_cube = int(bytes_per_pe) * _CMP_PES_PER_CUBE
        n_packets = max(1, -(-bytes_per_cube // _CMP_NOC_PACKET_BYTES))
        return _CMP_T_STARTUP_NS + (n_packets - 1) * _CMP_TAU_NS
    def _bytes_fmt(x, _pos):
        if x >= 1024 * 1024:
            return f"{x / (1024 * 1024):.0f}M"
        if x >= 1024:
            return f"{x / 1024:.0f}K"
        return f"{int(x)}"
    topologies = sorted({r["sip_topology"] for r in records})
    max_local = max(r["latency_ns"] for r in records)
    ext_x = max(r["bytes_per_pe"] for r in records)
    fig, (ax_top, ax_bot) = plt.subplots(
        2, 1, sharex=True,
        gridspec_kw={"height_ratios": [1, 4], "hspace": 0.05},
        figsize=(9, 6.5),
    )
    # Bottom panel: model curves + theoretical torus, linear y.
    for topo in topologies:
        rs = sorted([r for r in records if r["sip_topology"] == topo],
                    key=lambda r: r["bytes_per_pe"])
        if not rs:
            continue
        ax_bot.plot(
            [r["bytes_per_pe"] for r in rs],
            [r["latency_ns"] for r in rs],
            marker="o", label=_CMP_DISPLAY.get(topo, topo),
            color=_CMP_COLORS.get(topo),
        )
    torus_rs = sorted(
        [r for r in records if r["sip_topology"] == "torus_2d"],
        key=lambda r: r["bytes_per_pe"],
    )
    if torus_rs:
        ax_bot.plot(
            [r["bytes_per_pe"] for r in torus_rs],
            [_theoretical_torus_2d_ns(r["bytes_per_pe"]) for r in torus_rs],
            color="tab:red", linestyle="--", linewidth=1.6, marker="x",
            label="Theoretical 2D Torus 2x3",
        )
    ax_bot.set_ylim(0, max_local * 1.10)
    # Top panel: external FSIM single-device reference marker.
    ax_top.scatter(
        [ext_x], [_FSIM_EXT_LATENCY_NS],
        marker="*", s=240, color="tab:red", zorder=5,
        label=_FSIM_EXT_LABEL,
    )
    ax_top.set_ylim(_FSIM_EXT_LATENCY_NS * 0.93, _FSIM_EXT_LATENCY_NS * 1.05)
    # Hide spine between panels; draw diagonal break ticks.
    ax_top.spines["bottom"].set_visible(False)
    ax_bot.spines["top"].set_visible(False)
    ax_top.tick_params(labeltop=False, bottom=False)
    ax_bot.xaxis.tick_bottom()
    d = 0.012
    kw = dict(transform=ax_top.transAxes, color="k", clip_on=False, lw=1)
    ax_top.plot((-d, +d), (-d, +d), **kw)
    ax_top.plot((1 - d, 1 + d), (-d, +d), **kw)
    kw.update(transform=ax_bot.transAxes)
    ax_bot.plot((-d, +d), (1 - d * 4, 1 + d * 4), **kw)
    ax_bot.plot((1 - d, 1 + d), (1 - d * 4, 1 + d * 4), **kw)
    ax_bot.set_xscale("log", base=2)
    ax_bot.set_xlabel("Bytes per PE (log scale)")
    ax_bot.set_ylabel("Time (ns)")
    ax_top.set_ylabel("Time (ns)")
    ax_bot.grid(True, alpha=0.3)
    ax_top.grid(True, alpha=0.3)
    ax_bot.xaxis.set_major_formatter(mticker.FuncFormatter(_bytes_fmt))
    handles_bot, labels_bot = ax_bot.get_legend_handles_labels()
    handles_top, labels_top = ax_top.get_legend_handles_labels()
    ax_bot.legend(handles_bot + handles_top, labels_bot + labels_top,
                  loc="upper left")
    fig.suptitle("Multidevice allreduce (ring, Mesh, 2DTorus) vs FSIM latency")
    fig.tight_layout()
    out = (_SWEEP_OUT_DIR
           / "comparison_mesh_vs_ring_vs_2DTorus_vs_theoretical_vs_fsim.png")
    fig.savefig(out, dpi=120)
    plt.close(fig)
    return str(out)
@@ -0,0 +1,35 @@
 """Correctness of intercube allreduce across SIP topologies (distributed path).
 Routes through init_process_group → mp.spawn → dist.all_reduce for ring_1d,
 torus_2d (2×3), and mesh_2d_no_wrap (2×3). Per-rank correctness is asserted
 inside the worker; spawn raises on failure.
 """
 from __future__ import annotations
 import pytest
 from tests.sccl._allreduce_helpers import (
    CONFIGS,
    DEFAULT_N_ELEM,
    _crit_ns,
    _run_distributed,
    _write_temp_configs,
 )
@pytest.mark.parametrize(
    "algorithm,sip_topology,n_sips,sip_w,sip_h", CONFIGS,
 )
 def test_allreduce(
    tmp_path, monkeypatch, algorithm, sip_topology, n_sips, sip_w, sip_h,
 ):
    topo_path, _ = _write_temp_configs(
        tmp_path, sip_topology, n_sips, algorithm,
        sip_w=sip_w, sip_h=sip_h,
    )
    engine, _n_cubes = _run_distributed(
        tmp_path, monkeypatch, topo_path,
        f"test_{algorithm}_{sip_topology}", DEFAULT_N_ELEM,
    )
    # A positive critical path confirms the kernel actually ran.
    assert _crit_ns(engine) > 0.0
@@ -0,0 +1,47 @@
 """Full distributed path against topology.yaml as-is (no overrides).
 The same flow a real DDP training script would use:
 init_process_group(backend="ahbm") → mp.spawn → dist.all_reduce.
 """
 from __future__ import annotations
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 from tests.sccl._allreduce_helpers import (
    DEFAULT_N_ELEM,
    TOPOLOGY_PATH,
    _worker,
    _write_ccl_yaml,
 )
 def test_distributed_lrab_hierarchical_allreduce(tmp_path, monkeypatch):
    monkeypatch.chdir(_write_ccl_yaml(tmp_path))
    topo = resolve_topology(str(TOPOLOGY_PATH))
    engine = GraphEngine(topo.topology_obj, enable_data=True)
    spec = topo.topology_obj.spec
    n_sips = int(spec["system"]["sips"]["count"])
    cm = spec["sip"]["cube_mesh"]
    n_cubes = int(cm["w"]) * int(cm["h"])
    with RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id="dist_intercube_ar",
        spec=spec,
    ) as ctx:
        ctx.distributed.init_process_group(backend="ahbm")
        assert ctx.distributed.get_world_size() == n_sips
        t_start = engine._env.now
        ctx.multiprocessing.spawn(
            _worker, args=(n_cubes, DEFAULT_N_ELEM, n_sips, ctx),
            nprocs=n_sips,
        )
        t_end = engine._env.now
        print(f"\n[distributed] sim latency = "
              f"{t_end - t_start:.1f} ns ({(t_end - t_start) / 1000:.3f} us)")
@@ -1,7 +1,7 @@
-"""Phase 1 test for moving the intercube_allreduce root cube from the
+"""Phase 1 test for moving the lrab_hierarchical_allreduce root cube from the
 bottom-right corner (3,3) to the geometric center (2,2).
-Today's algorithm (intercube_allreduce.py) hardcodes
+Today's algorithm (lrab_hierarchical_allreduce.py) hardcodes
 ``root_cube = (cube_h-1) * cube_w + (cube_w-1)`` (= cube 15 in 4×4).
 The intra-SIP critical path for one allreduce is therefore::
@@ -40,7 +40,7 @@ from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
-from tests.test_allreduce_multidevice import (
+from tests.sccl._allreduce_helpers import (
    _write_temp_configs,
    run_allreduce,
 )
@@ -55,7 +55,7 @@ def _run_torus_96kb(tmp_path: Path) -> float:
        sub,
        sip_topology="torus_2d",
        n_sips=6,
-        algorithm="intercube_allreduce",
+        algorithm="lrab_hierarchical_allreduce",
        sip_w=3, sip_h=2,
        n_elem_override=49152,   # 49152 × 2 = 96 KB / slot
    )
@@ -70,7 +70,7 @@ def _run_torus_96kb(tmp_path: Path) -> float:
    ) as ctx:
        result = run_allreduce(
            ctx, engine, spec,
-            algorithm="intercube_allreduce", ccl_yaml=ccl_path,
+            algorithm="lrab_hierarchical_allreduce", ccl_yaml=ccl_path,
        )
        assert result["ok_cubes"] > 0
    pe_exec_vals = [
@@ -121,7 +121,7 @@ def test_correctness_preserved(tmp_path):
        sub,
        sip_topology="torus_2d",
        n_sips=6,
-        algorithm="intercube_allreduce",
+        algorithm="lrab_hierarchical_allreduce",
        sip_w=3, sip_h=2,
        n_elem_override=128,   # tiny payload to keep this fast
    )
@@ -136,7 +136,7 @@ def test_correctness_preserved(tmp_path):
    ) as ctx:
        result = run_allreduce(
            ctx, engine, spec,
-            algorithm="intercube_allreduce", ccl_yaml=ccl_path,
+            algorithm="lrab_hierarchical_allreduce", ccl_yaml=ccl_path,
        )
    n_cubes = 6 * 16  # 6 SIPs × 16 cubes/SIP
    assert result["ok_cubes"] == n_cubes, (
@@ -0,0 +1,66 @@
 """Buffer-kind sweep (TCM / SRAM / HBM) on torus_2d 6 SIPs (3×2), distributed.
 Each parametrized case writes one JSON row; the conftest sessionfinish hook
 calls ``aggregate_buffer_kind_plot`` to emit the comparison PNG + csv. Pre
 slot-latency modeling the three lines overlap exactly (slot access is
 latency-free today).
 """
 from __future__ import annotations
 import json
 import pytest
 import yaml
 from tests.sccl._allreduce_helpers import (
    _BK_ROWS_DIR,
    _ELEM_BYTES_F16,
    _bk_params,
    _crit_ns,
    _run_distributed,
    _write_temp_configs,
 )
@pytest.mark.parametrize("buffer_kind,n_elem", _bk_params())
 def test_buffer_kind_allreduce_one(tmp_path, monkeypatch, buffer_kind, n_elem):
    sub = tmp_path / f"{buffer_kind}_{n_elem}"
    sub.mkdir()
    topo_path, ccl_path = _write_temp_configs(
        sub,
        sip_topology="torus_2d",
        n_sips=6,
        algorithm="lrab_hierarchical_allreduce",
        sip_w=3, sip_h=2,
        n_elem_override=n_elem,
    )
    # Override buffer_kind in the temp ccl.yaml (read by the ahbm backend
    # at init_process_group time via load_ccl_config()).
    with open(ccl_path) as f:
        ccl_cfg = yaml.safe_load(f)
    ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
    ccl_cfg.setdefault("algorithms", {}).setdefault(
        "lrab_hierarchical_allreduce", {},
    )["buffer_kind"] = buffer_kind
    with open(ccl_path, "w") as f:
        yaml.dump(ccl_cfg, f, default_flow_style=False)
    engine, _ = _run_distributed(
        sub, monkeypatch, topo_path,
        f"bk_sweep_{buffer_kind}_{n_elem}", n_elem,
    )
    crit_ns = _crit_ns(engine)
    bytes_per_pe = n_elem * _ELEM_BYTES_F16
    record = {
        "buffer_kind": buffer_kind,
        "sip_topology": "torus_2d",
        "n_sips": 6,
        "n_elem": n_elem,
        "bytes_per_pe": bytes_per_pe,
        "latency_ns": crit_ns,
    }
    _BK_ROWS_DIR.mkdir(parents=True, exist_ok=True)
    row_path = _BK_ROWS_DIR / f"{buffer_kind}_{n_elem}.json"
    with open(row_path, "w", encoding="utf-8") as f:
        json.dump(record, f)
@@ -0,0 +1,23 @@
 """Emit the broken-y-axis allreduce-vs-FSIM comparison plot.
 Post-processes summary.csv (written by the latency sweep) into
 comparison_mesh_vs_ring_vs_2DTorus_vs_theoretical_vs_fsim.png. Pure
 plotting; reads the on-disk summary.csv (skips if the sweep has never run).
 """
 from __future__ import annotations
 from pathlib import Path
 import pytest
 from tests.sccl._allreduce_helpers import (
    _SWEEP_OUT_DIR,
    emit_comparison_fsim_plot,
 )
 def test_emit_comparison_fsim_plot():
    if not (_SWEEP_OUT_DIR / "summary.csv").exists():
        pytest.skip("summary.csv absent; run the latency sweep first")
    out = emit_comparison_fsim_plot()
    assert out is not None and Path(out).exists()
@@ -0,0 +1,58 @@
 """Allreduce latency sweep (distributed path), xdist-friendly.
 Each parametrized case writes one JSON row to the shared staging dir; the
 conftest sessionfinish hook calls ``_aggregate_sweep_plots`` to emit the
 per-topology PNGs + summary.csv after all cases finish.
 """
 from __future__ import annotations
 import json
 import pytest
 from tests.sccl._allreduce_helpers import (
    _ELEM_BYTES_F16,
    _SWEEP_ROWS_DIR,
    _crit_ns,
    _run_distributed,
    _sweep_params,
    _write_temp_configs,
 )
@pytest.mark.parametrize(
    "algorithm,sip_topology,n_sips,sip_w,sip_h,n_elem", _sweep_params(),
 )
 def test_allreduce_latency_one(
    tmp_path, monkeypatch, algorithm, sip_topology, n_sips, sip_w, sip_h,
    n_elem,
 ):
    topo_path, _ = _write_temp_configs(
        tmp_path, sip_topology, n_sips, algorithm,
        sip_w=sip_w, sip_h=sip_h,
        n_elem_override=n_elem,
    )
    engine, n_cubes = _run_distributed(
        tmp_path, monkeypatch, topo_path,
        f"sweep_{algorithm}_{sip_topology}_{n_elem}", n_elem,
    )
    crit_ns = _crit_ns(engine)
    bytes_per_sip = n_cubes * n_elem * _ELEM_BYTES_F16
    bytes_per_pe = n_elem * _ELEM_BYTES_F16
    record = {
        "algorithm": algorithm,
        "sip_topology": sip_topology,
        "n_sips": n_sips,
        "n_elem": n_elem,
        "bytes_per_pe": bytes_per_pe,
        "bytes_per_sip": bytes_per_sip,
        "latency_ns": crit_ns,
    }
    _SWEEP_ROWS_DIR.mkdir(parents=True, exist_ok=True)
    row_path = _SWEEP_ROWS_DIR / f"{sip_topology}_{n_elem}.json"
    with open(row_path, "w", encoding="utf-8") as f:
        json.dump(record, f)
@@ -0,0 +1,11 @@
 """Emit topology.png (device-level + cube-level reduction). Pure plotting; no sim."""
 from __future__ import annotations
 from pathlib import Path
 from tests.sccl._allreduce_helpers import emit_topology_diagram
 def test_emit_topology_diagram():
    out = emit_topology_diagram()
    assert Path(out).exists()
@@ -1,196 +0,0 @@
 """Phase 1 buffer-kind allreduce sweep — torus_2d 6 SIPs.
 Parametrized over (buffer_kind, n_elem). Each case runs the standard
 config-driven allreduce app and writes a JSON row to a shared staging
 dir; the conftest sessionfinish hook (added in Phase 1) aggregates
 rows into ``docs/diagrams/allreduce_latency_plots/buffer_kind_sweep.png``.
 Pre-Phase-2: the three buffer-kind lines overlap exactly because slot
 access is latency-free today. Post-Phase-2 they spread out (tcm
 fastest, hbm slowest).
 """
 from __future__ import annotations
 import json
 from pathlib import Path
 import pytest
 import yaml
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 # Reuse the allreduce app helpers.
 from tests.test_allreduce_multidevice import (
    _write_temp_configs,
    run_allreduce,
 )
 _BUFFER_KINDS = ["tcm", "sram", "hbm"]
 _N_ELEM_GRID = [128, 1024, 8192, 32768]   # 256 B → 64 KB per slot
 _ELEM_BYTES_F16 = 2
 _OUT_DIR = (Path(__file__).parent.parent / "docs" / "diagrams"
            / "allreduce_latency_plots")
 _ROWS_DIR = _OUT_DIR / "_buffer_kind_rows"
 def _bk_params():
    out = []
    for bk in _BUFFER_KINDS:
        for n_elem in _N_ELEM_GRID:
            out.append(pytest.param(bk, n_elem, id=f"{bk}-n_elem{n_elem}"))
    return out
@pytest.mark.parametrize("buffer_kind,n_elem", _bk_params())
 def test_buffer_kind_allreduce_one(tmp_path, buffer_kind, n_elem):
    """One config of the buffer-kind sweep. xdist parallelizes."""
    sub = tmp_path / f"{buffer_kind}_{n_elem}"
    sub.mkdir()
    topo_path, ccl_path = _write_temp_configs(
        sub,
        sip_topology="torus_2d",
        n_sips=6,
        algorithm="intercube_allreduce",
        sip_w=3, sip_h=2,
        n_elem_override=n_elem,
    )
    # Override buffer_kind in the temp ccl.yaml.
    with open(ccl_path) as f:
        ccl_cfg = yaml.safe_load(f)
    ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
    ccl_cfg.setdefault("algorithms", {}).setdefault(
        "intercube_allreduce", {},
    )["buffer_kind"] = buffer_kind
    with open(ccl_path, "w") as f:
        yaml.dump(ccl_cfg, f, default_flow_style=False)
    topo = resolve_topology(topo_path)
    engine = GraphEngine(topo.topology_obj, enable_data=True)
    spec = topo.topology_obj.spec
    with RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id=f"bk_sweep_{buffer_kind}_{n_elem}",
        spec=spec,
    ) as ctx:
        result = run_allreduce(
            ctx, engine, spec,
            algorithm="intercube_allreduce", ccl_yaml=ccl_path,
        )
        assert result["ok_cubes"] > 0
    pe_exec_vals = [
        float(tr.get("pe_exec_ns", 0.0) or 0.0)
        for _, (_, tr) in engine._results.items()
        if isinstance(tr, dict)
    ]
    crit_ns = max(pe_exec_vals) if pe_exec_vals else 0.0
    bytes_per_pe = n_elem * _ELEM_BYTES_F16
    record = {
        "buffer_kind": buffer_kind,
        "sip_topology": "torus_2d",
        "n_sips": 6,
        "n_elem": n_elem,
        "bytes_per_pe": bytes_per_pe,
        "latency_ns": crit_ns,
    }
    _ROWS_DIR.mkdir(parents=True, exist_ok=True)
    row_path = _ROWS_DIR / f"{buffer_kind}_{n_elem}.json"
    with open(row_path, "w", encoding="utf-8") as f:
        json.dump(record, f)
 def aggregate_buffer_kind_plot() -> bool:
    """Read per-config rows and emit buffer_kind_sweep.png + CSV.
    Called from conftest.pytest_sessionfinish (controller-only).
    Returns True if rows were aggregated.
    """
    import csv
    if not _ROWS_DIR.exists():
        return False
    row_files = sorted(_ROWS_DIR.glob("*.json"))
    if not row_files:
        return False
    records = []
    for p in row_files:
        with open(p, encoding="utf-8") as f:
            records.append(json.load(f))
    import matplotlib.pyplot as plt
    from matplotlib.ticker import FuncFormatter
    def _fmt_bytes(x, _pos):
        if x <= 0:
            return "0"
        if x >= 1024 * 1024:
            return f"{x / (1024 * 1024):.0f} MB"
        if x >= 1024:
            return f"{x / 1024:.0f} KB"
        return f"{x:.0f} B"
    _bytes_fmt = FuncFormatter(_fmt_bytes)
    _OUT_DIR.mkdir(parents=True, exist_ok=True)
    with open(_OUT_DIR / "buffer_kind_sweep.csv", "w",
              newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=[
            "buffer_kind", "sip_topology", "n_sips", "n_elem",
            "bytes_per_pe", "latency_ns",
        ])
        w.writeheader()
        for r in sorted(records, key=lambda r: (
            r["buffer_kind"], r["bytes_per_pe"],
        )):
            w.writerow(r)
    colors = {"tcm": "tab:blue", "sram": "tab:orange", "hbm": "tab:red"}
    fig, ax = plt.subplots(figsize=(10, 6))
    for bk in ["tcm", "sram", "hbm"]:
        rs = sorted(
            [r for r in records if r["buffer_kind"] == bk],
            key=lambda r: r["bytes_per_pe"],
        )
        if not rs:
            continue
        ax.plot(
            [r["bytes_per_pe"] for r in rs],
            [r["latency_ns"] for r in rs],
            marker="o", lw=2.0,
            color=colors[bk], label=f"buffer_kind = {bk}",
        )
    ax.set_xscale("log", base=2)
    ax.set_xlabel("Bytes per PE (log scale)")
    ax.set_ylabel("Time (ns)")
    ax.set_title(
        "Allreduce torus_2d (6 SIPs, 3×2) — IPCQ slot memory tier"
    )
    ax.grid(True, alpha=0.3)
    ax.legend()
    ax.xaxis.set_major_formatter(_bytes_fmt)
    fig.tight_layout()
    fig.savefig(_OUT_DIR / "buffer_kind_sweep.png", dpi=130)
    plt.close(fig)
    for p in row_files:
        try:
            p.unlink()
        except OSError:
            pass
    try:
        _ROWS_DIR.rmdir()
    except OSError:
        pass
    print(f"\nWrote {_OUT_DIR / 'buffer_kind_sweep.png'} "
          f"from {len(records)} rows")
    return True
@@ -0,0 +1,95 @@
 """Tests for kernbench.benches.registry — @bench decorator + resolve/list."""
 from __future__ import annotations
 import pytest
 from kernbench.benches import registry
 EXPECTED_NAMES = [
    "ccl-allreduce",
    "gemm-single-pe",
    "gpt3-qkv",
    "ipcq-allreduce",
    "matmul-composite",
    "qkv-gemm",
    "qkv-gemm-multi-pe",
    "va-offset-verify",
 ]
 def test_registry_lists_all_benches():
    specs = registry.list_all()
    names = [s.name for s in specs]
    assert names == EXPECTED_NAMES
 def test_registry_indices_are_1_based_sorted_by_name():
    specs = registry.list_all()
    assert [s.index for s in specs] == list(range(1, len(EXPECTED_NAMES) + 1))
    assert sorted(s.name for s in specs) == [s.name for s in specs]
 def test_resolve_by_name_returns_spec():
    spec = registry.resolve("gemm-single-pe")
    assert spec.name == "gemm-single-pe"
    assert callable(spec.run)
    assert spec.description.strip()
 def test_resolve_by_index_string_matches_list_order():
    specs = registry.list_all()
    third = specs[2]
    resolved = registry.resolve(str(third.index))
    assert resolved is third
 def test_resolve_unknown_name_raises():
    with pytest.raises(ValueError, match="kernbench list"):
        registry.resolve("does-not-exist")
 def test_resolve_unknown_index_raises():
    with pytest.raises(ValueError, match="kernbench list"):
        registry.resolve("99")
 def test_resolve_empty_identifier_raises():
    with pytest.raises(ValueError):
        registry.resolve("")
 def test_bench_decorator_rejects_invalid_name():
    with pytest.raises(ValueError, match="kebab-case"):
        registry.bench(name="Invalid_Name", description="x")
 def test_bench_decorator_rejects_empty_description():
    with pytest.raises(ValueError, match="non-empty"):
        registry.bench(name="ok-name", description="   ")
 def test_audit_raises_on_missing_decorator():
    with pytest.raises(RuntimeError, match="missing @bench decorator"):
        registry._audit_modules(
            imported=["kernbench.benches.fake_no_dec", "kernbench.benches.real"],
            registered={"kernbench.benches.real"},
        )
 def test_audit_passes_when_all_registered():
    registry._audit_modules(
        imported=["kernbench.benches.a", "kernbench.benches.b"],
        registered={"kernbench.benches.a", "kernbench.benches.b"},
    )
 def test_duplicate_name_at_finalize_fails(monkeypatch):
    """_finalize() rejects two pending entries with the same name."""
    monkeypatch.setattr(registry, "_PENDING", [
        ("dup", "d1", lambda: None),
        ("dup", "d2", lambda: None),
    ])
    monkeypatch.setattr(registry, "_REGISTRY", {})
    with pytest.raises(RuntimeError, match="duplicate bench name"):
        registry._finalize()
@@ -6,17 +6,17 @@ def test_cli_main_arg_parsing(monkeypatch):
    def fake_cmd_run(args) -> int:
        assert args.cmd == "run"
        assert args.topology == "topology.yaml"
-        assert args.bench == "qkv_gemm"
+        assert args.bench == "qkv-gemm"
        assert args.device == None
        return 0
    # monkey patch the handler to test arg parsing without running the actual bench
    monkeypatch.setattr(cli_main, "cmd_run", fake_cmd_run)
-    rc = cli_main.main(["run", "--topology", "topology.yaml", "--bench", "qkv_gemm"])
+    rc = cli_main.main(["run", "--topology", "topology.yaml", "--bench", "qkv-gemm"])
    assert rc == 0
 def test_cli_main():
    """CLI bench run on single SIP device."""
-    rc = cli_main.main(["run", "--topology", "topology.yaml", "--bench", "qkv_gemm", "--device", "sip:0"])
+    rc = cli_main.main(["run", "--topology", "topology.yaml", "--bench", "qkv-gemm", "--device", "sip:0"])
    assert rc == 0
@@ -0,0 +1,44 @@
 """Tests for `kernbench list` subcommand and `--bench <index>` resolution."""
 from __future__ import annotations
 import kernbench.cli.main as cli_main
 from kernbench.benches import registry
 def test_cli_list_outputs_all_benches(capsys):
    rc = cli_main.main(["list"])
    assert rc == 0
    out = capsys.readouterr().out
    for spec in registry.list_all():
        assert spec.name in out
    assert "DESCRIPTION" in out
 def test_cli_run_by_index(monkeypatch):
    """CLI accepts numeric index for --bench; same callable as the name."""
    qkv_spec = registry.resolve("qkv-gemm")
    captured = {}
    def fake_run_bench(*, topology, bench_fn, device, engine_factory):
        captured["bench_fn"] = bench_fn
        class _R:
            traces = []
            engine = None
            class completion:
                ok = True
            def summary_text(self):
                return ""
        return _R()
    monkeypatch.setattr(cli_main, "run_bench", fake_run_bench)
    rc = cli_main.main([
        "run", "--topology", "topology.yaml",
        "--bench", str(qkv_spec.index),
        "--device", "sip:0",
    ])
    assert rc == 0
    assert captured["bench_fn"] is qkv_spec.run
@@ -11,7 +11,7 @@ def test_cli_verify_data_flag_parsed(monkeypatch):
    monkeypatch.setattr(cli_main, "cmd_run", fake_cmd_run)
    rc = cli_main.main([
-        "run", "--topology", "topology.yaml", "--bench", "qkv_gemm",
+        "run", "--topology", "topology.yaml", "--bench", "qkv-gemm",
        "--verify-data",
    ])
    assert rc == 0
@@ -26,7 +26,7 @@ def test_cli_verify_data_flag_default(monkeypatch):
    monkeypatch.setattr(cli_main, "cmd_run", fake_cmd_run)
    rc = cli_main.main([
-        "run", "--topology", "topology.yaml", "--bench", "qkv_gemm",
+        "run", "--topology", "topology.yaml", "--bench", "qkv-gemm",
    ])
    assert rc == 0
@@ -34,7 +34,7 @@ def test_cli_verify_data_flag_default(monkeypatch):
 def test_cmd_run_verify_data_enables_engine():
    """--verify-data runs full pipeline with enable_data=True and DataExecutor."""
    rc = cli_main.main([
-        "run", "--topology", "topology.yaml", "--bench", "qkv_gemm",
+        "run", "--topology", "topology.yaml", "--bench", "qkv-gemm",
        "--device", "sip:0", "--verify-data",
    ])
    assert rc == 0
@@ -43,7 +43,7 @@ def test_cmd_run_verify_data_enables_engine():
 def test_cmd_run_without_verify_data_no_op_log():
    """Without --verify-data, engine runs in timing-only mode (no op_log)."""
    rc = cli_main.main([
-        "run", "--topology", "topology.yaml", "--bench", "qkv_gemm",
+        "run", "--topology", "topology.yaml", "--bench", "qkv-gemm",
        "--device", "sip:0",
    ])
    assert rc == 0
@@ -1,119 +0,0 @@
 """End-to-end distributed test for intercube allreduce.
 Exercises the full process-group path:
    dist.init_process_group(backend="ahbm")
    → mp.spawn(nprocs=n_sips)
    → each worker: set_device → allocate → fill → dist.all_reduce → verify
 This is the same flow a real DDP training script would use.
 """
 from __future__ import annotations
 import os
 import textwrap
 from pathlib import Path
 import numpy as np
 import pytest
 TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
 N_CUBES = 16
 N_ELEM = 8
 def _write_ccl_yaml(tmp_path) -> str:
    body = textwrap.dedent("""\
        defaults:
          algorithm: intercube_allreduce
          buffer_kind: tcm
          backpressure: sleep
          n_slots: 4
          slot_size: 4096
          vc_chunk_size: 256
          ipcq_credit_size_bytes: 16
        algorithms:
          intercube_allreduce:
            module: kernbench.ccl.algorithms.intercube_allreduce
            topology: none
            buffer_kind: tcm
            n_elem: 8
            root_cube: 15
    """)
    (tmp_path / "ccl.yaml").write_text(body)
    return str(tmp_path)
 def _worker(rank: int, n_sips: int, torch) -> None:
    """Per-SIP worker: allocate, fill, all_reduce, verify."""
    from kernbench.policy.placement.dp import DPPolicy
    torch.ahbm.set_device(rank)
    dp = DPPolicy(
        cube="row_wise", pe="replicate",
        num_pes=1, num_cubes=N_CUBES,
    )
    tensor = torch.zeros(
        (N_CUBES, N_ELEM), dtype="f16", dp=dp,
        name=f"sip{rank}",
    )
    init_arr = np.full((N_CUBES, N_ELEM), float(rank + 1), dtype=np.float16)
    tensor.copy_(torch.from_numpy(init_arr))
    print(f"[SIP {rank}] input  cube0[:4] = {tensor.numpy()[0][:4].tolist()}")
    torch.distributed.all_reduce(tensor, op="sum")
    arr = tensor.numpy()
    expected = float(N_CUBES * sum(range(1, n_sips + 1)))
    print(f"[SIP {rank}] output cube0[:4]  = {arr[0][:4].tolist()}")
    print(f"[SIP {rank}] output cube15[:4] = {arr[15][:4].tolist()}")
    for cube_id in range(N_CUBES):
        assert np.allclose(arr[cube_id], expected, rtol=1e-1, atol=1e-1), (
            f"SIP{rank} cube {cube_id}: "
            f"got {arr[cube_id][:4]}, expected {expected}"
        )
    if rank == 0:
        print(f"\n  intercube_allreduce (ws={n_sips}): "
              f"{n_sips * N_CUBES} OK")
 def test_distributed_intercube_allreduce(tmp_path, monkeypatch):
    """Full distributed path: init_process_group → mp.spawn → all_reduce."""
    from kernbench.runtime_api.context import RuntimeContext
    from kernbench.runtime_api.types import DeviceSelector
    from kernbench.sim_engine.engine import GraphEngine
    from kernbench.topology.builder import resolve_topology
    monkeypatch.chdir(_write_ccl_yaml(tmp_path))
    topo = resolve_topology(str(TOPOLOGY_PATH))
    engine = GraphEngine(topo.topology_obj, enable_data=True)
    spec = topo.topology_obj.spec
    n_sips = int(spec["system"]["sips"]["count"])
    with RuntimeContext(
        engine=engine,
        target_device=DeviceSelector("all"),
        correlation_id="dist_intercube_ar",
        spec=spec,
    ) as ctx:
        ctx.distributed.init_process_group(backend="ahbm")
        assert ctx.distributed.get_world_size() == n_sips
        t_start = engine._env.now
        ctx.multiprocessing.spawn(
            _worker, args=(n_sips, ctx), nprocs=n_sips,
        )
        t_end = engine._env.now
        print(f"\n[distributed] sim latency = "
              f"{t_end - t_start:.1f} ns ({(t_end - t_start) / 1000:.3f} us)")
@@ -28,7 +28,7 @@ def _engine_and_spec():
 def _merged_cfg():
    cfg = load_ccl_config()
-    return resolve_algorithm_config(cfg, name="intercube_allreduce")
+    return resolve_algorithm_config(cfg, name="lrab_hierarchical_allreduce")
 class TestConfigureSfrNeighborTables:
@@ -20,7 +20,7 @@ Reference (Phase 2 will edit these):
  - ccl.yaml                                    — algorithm.buffer_kind
 The tests reuse the existing config-driven allreduce app
-(``run_allreduce`` in tests/test_allreduce_multidevice.py) with a 2-SIP
+(``run_allreduce`` in tests/sccl/_allreduce_helpers.py) with a 2-SIP
 ring topology and a SMALL n_elem so they finish fast (~3-5 s each).
 """
 from __future__ import annotations
@@ -37,7 +37,7 @@ from kernbench.topology.builder import resolve_topology
 # Reuse the test app's helpers so this micro-test file does not
 # duplicate the run-allreduce + write-temp-configs plumbing.
-from tests.test_allreduce_multidevice import (
+from tests.sccl._allreduce_helpers import (
    _write_temp_configs,
    run_allreduce,
 )
@@ -81,7 +81,7 @@ def _run_torus_allreduce(
        sub,
        sip_topology="torus_2d",
        n_sips=6,
-        algorithm="intercube_allreduce",
+        algorithm="lrab_hierarchical_allreduce",
        sip_w=3, sip_h=2,
        n_elem_override=n_elem,
    )
@@ -92,7 +92,7 @@ def _run_torus_allreduce(
        ccl_cfg = yaml.safe_load(f)
    ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
    ccl_cfg.setdefault("algorithms", {}).setdefault(
-        "intercube_allreduce", {},
+        "lrab_hierarchical_allreduce", {},
    )["buffer_kind"] = buffer_kind
    with open(ccl_path, "w") as f:
        yaml.dump(ccl_cfg, f, default_flow_style=False)
@@ -109,7 +109,7 @@ def _run_torus_allreduce(
    ) as ctx:
        result = run_allreduce(
            ctx, engine, spec,
-            algorithm="intercube_allreduce", ccl_yaml=ccl_path,
+            algorithm="lrab_hierarchical_allreduce", ccl_yaml=ccl_path,
        )
        assert result["ok_cubes"] > 0, "allreduce did not validate"
@@ -47,7 +47,7 @@ from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
-from tests.test_allreduce_multidevice import (
+from tests.sccl._allreduce_helpers import (
    _write_temp_configs,
    run_allreduce,
 )
@@ -59,8 +59,9 @@ def _run_allreduce_with_buffer_kind(
    """Run one torus_2d 6-SIP allreduce with the given buffer_kind and
    return critical-path pe_exec_ns (max across all PEs).
-    Mirrors the sweep harness in test_allreduce_buffer_kind_sweep.py
+    Mirrors the buffer-kind sweep harness in
-    so the assertions below compare apples-to-apples against that PNG.
+    tests/sccl/test_plot_buffer_kind_sweep.py so the assertions
    below compare apples-to-apples against that PNG.
    """
    sub = tmp_path / f"{buffer_kind}_{n_elem}"
    sub.mkdir()
@@ -68,7 +69,7 @@ def _run_allreduce_with_buffer_kind(
        sub,
        sip_topology="torus_2d",
        n_sips=6,
-        algorithm="intercube_allreduce",
+        algorithm="lrab_hierarchical_allreduce",
        sip_w=3, sip_h=2,
        n_elem_override=n_elem,
    )
@@ -77,7 +78,7 @@ def _run_allreduce_with_buffer_kind(
        ccl_cfg = yaml.safe_load(f)
    ccl_cfg.setdefault("defaults", {})["buffer_kind"] = buffer_kind
    ccl_cfg.setdefault("algorithms", {}).setdefault(
-        "intercube_allreduce", {},
+        "lrab_hierarchical_allreduce", {},
    )["buffer_kind"] = buffer_kind
    with open(ccl_path, "w") as f:
        yaml.dump(ccl_cfg, f, default_flow_style=False)
@@ -94,7 +95,7 @@ def _run_allreduce_with_buffer_kind(
    ) as ctx:
        result = run_allreduce(
            ctx, engine, spec,
-            algorithm="intercube_allreduce", ccl_yaml=ccl_path,
+            algorithm="lrab_hierarchical_allreduce", ccl_yaml=ccl_path,
        )
        assert result["ok_cubes"] > 0, "allreduce did not validate"
@@ -235,7 +235,7 @@ def test_qkv_gemm_still_passes():
        correlation_id="test_regression",
        spec=graph.spec,
    )
-    from benches.qkv_gemm import run as bench_run
+    from kernbench.benches.qkv_gemm import run as bench_run
    bench_run(ctx)
    ctx.wait_all()
    # If we get here without exception, the benchmark succeeded
@@ -864,7 +864,7 @@ def test_mcpu_kernel_launch_composite():
 def test_qkv_gemm_bench_completes():
    """The qkv_gemm benchmark runs to completion without error."""
    clear_registry()
-    from benches.qkv_gemm import run as bench_run
+    from kernbench.benches.qkv_gemm import run as bench_run
    from kernbench.runtime_api.context import RuntimeContext
    graph = load_topology(TOPOLOGY_PATH)
@@ -958,7 +958,7 @@ def test_mcpu_multi_pe_kernel_launch():
 def test_qkv_gemm_bench_multi_pe_completes():
    """The qkv_gemm_multi_pe benchmark runs to completion without error."""
    clear_registry()
-    from benches.qkv_gemm_multi_pe import run as bench_run
+    from kernbench.benches.qkv_gemm_multi_pe import run as bench_run
    from kernbench.runtime_api.context import RuntimeContext
    graph = load_topology(TOPOLOGY_PATH)
@@ -472,7 +472,7 @@ def _run_ipcq():
        dst_sip, dst_cube, dst_pe = DST
        cfg = load_ccl_config()
-        merged = resolve_algorithm_config(cfg, name="intercube_allreduce")
+        merged = resolve_algorithm_config(cfg, name="lrab_hierarchical_allreduce")
        merged["slot_size"] = max(int(merged.get("slot_size", 4096)), NBYTES)
        with RuntimeContext(
@@ -56,13 +56,17 @@ class Hop:
 HOPS = [
-    Hop("h1_intra_horizontal", "Intra-cube horizontal (pe0 to pe1)",
+    Hop("latency_intracube_PE0_to_PE1_horizontal",
        "Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal)",
        (0, 0, 0), (0, 0, 1), "intra_E", "intra_W", True),
-    Hop("h2_intra_vertical", "Intra-cube vertical (pe0 to pe4)",
+    Hop("latency_intracube_PE0_to_PE4_vertical",
        "Intra-cube PE-to-PE latency: PE0 → PE4 (vertical)",
        (0, 0, 0), (0, 0, 4), "intra_S", "intra_N", True),
-    Hop("h3_inter_cube_horizontal", "Inter-cube horizontal (cube0 to cube1)",
+    Hop("latency_intercube_C0PE0_to_C1PE0_horizontal",
        "Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal)",
        (0, 0, 0), (0, 1, 0), "E", "W", True),
-    Hop("h4_inter_cube_vertical", "Inter-cube vertical (cube0 to cube4)",
+    Hop("latency_intercube_C0PE0_to_C4PE0_vertical",
        "Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical)",
        (0, 0, 0), (0, 4, 0), "S", "N", True),
 ]
@@ -80,7 +84,7 @@ def _measure_ipcq(hop: Hop, nbytes: int) -> float:
    engine, spec = _make_engine()
    cfg = load_ccl_config()
-    merged = resolve_algorithm_config(cfg, name="intercube_allreduce")
+    merged = resolve_algorithm_config(cfg, name="lrab_hierarchical_allreduce")
    merged["slot_size"] = max(int(merged.get("slot_size", 4096)), nbytes)
    n_elem = nbytes // ELEM_BYTES
@@ -263,7 +263,7 @@ def test_pe_cross_cube_best_worst():
 def test_probe_timestamp_trace():
    """_hop_timestamps must return monotonically increasing cumulative timestamps."""
-    from kernbench.cli.probe import _hop_timestamps, _build_edge_map
+    from kernbench.probes.probe import _hop_timestamps, _build_edge_map
    graph = _graph()
    edge_map = _build_edge_map(graph)
    resolver = AddressResolver(graph)
@@ -341,7 +341,7 @@ def test_hbm_efficiency_applied():
 def test_probe_sweep_saturation():
    """Utilization at 1MB must exceed utilization at 4KB for pe-local-hbm."""
-    from kernbench.cli.probe import _sweep_util
+    from kernbench.probes.probe import _sweep_util
    # pe-local-hbm: ovhd=2ns (router), wire~0.03ns, bn from topology
    bn = _hbm_effective_bw()
    u = _sweep_util(2.0, 0.03, bn)
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
mukesh	fd56b6cacd	adr: add ADR-0043/0044 (eval harnesses); reconcile ADR-0024/0032 for SIP w/h Document the allreduce + GEMM evaluation harnesses and bring the affected allreduce ADRs in line with the refactored code. New (Accepted, EN + KO): - ADR-0043 — allreduce evaluation harness (tests/sccl/): distributed-driven correctness, latency/buffer-kind sweeps, sessionfinish plot aggregators, topology + FSIM-comparison figures. Verified against the implementation. - ADR-0044 — GEMM evaluation harness (scripts/gemm_sweep.py + tests/gemm/): heavy-script data gen vs. fast test-rendered figures, slow regenerator, the 3-figure set. Records two limitations as open questions: the theoretical-model constants are inherited (not yet traced to ADR-0033/ 0014), and the *_measured figure is a naming misnomer. Updated (EN + KO): - ADR-0024 — add D5: SIP grid w/h resolution (explicit sips.w/h, square fallback, fail-loud), documenting the AhbmCCLBackend fix. - ADR-0032 — D4/D5/Non-goals reconciled: rectangular SIP grids (e.g. 6 SIPs as 3x2) are supported via explicit w/h; the square requirement now applies only to the fallback. Affected-files repointed to tests/sccl/. Verification: ADR-0023 and ADR-0042 confirmed still matching the code (no change). verify_adr_lang_pairs.py passes (EN/KO Status blocks byte-equal). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 10:26:25 -07:00
mukesh	0e346b939d	gemm: test-generated GEMM plots under tests/gemm/ + docs/diagrams/gemm_plots/ Mirror the sccl pattern for GEMM figures: a tests/gemm/ package renders the GEMM bar charts as PNGs from the committed docs/diagrams/gemm_sweep.json, so the figures are fast test artifacts (run by default) while the heavy sim sweep stays a manual script (scripts/gemm_sweep.py, kept) wrapped by a slow regenerator test. tests/gemm/: - _gemm_plot_helpers.py: matplotlib renderers (series logic mirrors the GEMM _render_* functions in scripts/build_overview_slides.py). - test_plot_gemm_stage_breakdown.py: gemm_stage_breakdown.png (load_ref). - test_plot_gemm_mac_utilization.py: gemm_mac_utilization_measured.png + gemm_mac_utilization_theoretical_vs_measured.png (load_ref). - test_gemm_sweep.py: @pytest.mark.slow regenerator (runs scripts/gemm_sweep.py). Chart set trimmed to three (stage breakdown, MAC util, theoretical-vs-measured); "formula" relabeled to "theoretical" throughout the comparison chart. Known follow-ups (not blocking): - gemm_mac_utilization_measured.png currently plots the theoretical ideal- pipeline model, not simulator-measured data; the name is a misnomer pending a decision to repoint its content or retitle. - The theoretical-model constants (HBM 256 GB/s, T_stage 16 ns, 3 stages) are inherited verbatim from build_overview_slides.py and not yet verified against ADR-0033 / ADR-0014 / topology. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 09:58:08 -07:00
mukesh	b610cb0d9a	sccl: drive allreduce tests via torch.distributed; reorganize into tests/sccl/ Convert the multidevice allreduce correctness + latency/buffer-kind sweeps to run through the real PyTorch-distributed path (init_process_group(backend="ahbm") -> mp.spawn -> dist.all_reduce) instead of direct ctx.launch, and reorganize the CCL/allreduce tests into a tests/sccl/ package split one test per file. Production change (required for the distributed path on non-square SIP grids): - AhbmCCLBackend now reads explicit system.sips.w/h from the spec, with a square-only sqrt fallback that raises on ambiguity, instead of silently guessing round(sqrt(count)). This fixes the 2x3 / 3x2 torus + mesh cases, which previously resolved to a wrong 2x2 grid. Mirrors the test helper's _sip_topo_dims precedence (explicit w/h > square fallback > raise). Test reorganization (tests/sccl/): - _allreduce_helpers.py: shared plumbing (distributed driver, config writers, direct-launch run_allreduce parity reference, sweep/buffer-kind constants, plot aggregators, topology-diagram + FSIM-comparison emitters). - test_allreduce_ring_torus_mesh.py: correctness across ring/torus/mesh. - test_distributed_default_topology.py: full distributed path on topology.yaml. - test_plot_latency_sweep.py / test_plot_buffer_kind_sweep.py: sweep rows. - test_plot_topology_diagram.py / test_plot_comparison_fsim.py: plot emitters. - test_intercube_root_center.py: moved in (ADR-0032 center-root latency guard). Also: - Move the FSIM comparison plot generator out of scripts/ into the sccl suite. - Delete superseded test files (test_allreduce_multidevice, test_distributed_lrab_hierarchical_allreduce, test_allreduce_buffer_kind_sweep) and repoint conftest aggregators + the ipcq buffer-kind importers. - Regenerate the allreduce_latency_plots derived artifacts from the full sweep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 22:24:43 -07:00
mukesh	ff7d727ddd	CCL allreduce: rename to lrab_hierarchical_allreduce + descriptive plots Rename the intercube all-reduce identity to lrab_hierarchical_allreduce (module, config key, distributed test) so the name reflects both levels it implements: LRAB intra-SIP (local reduce to center root + broadcast) and the hierarchical inter-SIP topology exchange (ring/torus/mesh). ADR-0032 slug kept as the stable decision id; pure rename, no logic change. Also in this batch: - ADR-0032 (EN+KO): document the shipped center-root bidirectional reduce (doc was stale corner-root); annotate ccl.yaml root_cube as a placeholder. - Rename allreduce + pe2pe latency plots to descriptive, title-matching filenames and retitle the in-plot headings; drop overview/overview_log. - Point the PPTX image refs at the new plot names. Doc + derived-artifact + rename only; no simulation behavior changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 20:50:48 -07:00
ywkang	e77e4a1703	types: narrow BenchResult.engine to GraphEngine, cast topology in engine_factory Replace BenchResult.engine: object \| None with GraphEngine \| None via TYPE_CHECKING import (avoids circular import at runtime). Cast the topology argument to TopologyGraph at the GraphEngine call site for the duck-typed engine_factory. Fixes Pylance reportAttributeAccessIssue warnings on op_log and topology arg. Type annotations only; no runtime behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 14:54:18 -07:00
ywkang	1f36baa898	ADR: add 0038-0042 (pcie_ep, pe_mmu, pe_tcm, sram, tiling) Fill component-model coverage gaps surfaced by /report's G4 analysis. Each ADR documents the component's First action, latency model, and honest notes on dormant code or implementation asymmetries discovered during re-evaluation against current code. - 0038 pcie_ep: thin protocol-overhead model; ComponentBase forwarding worker as-is; named-node contract for router helpers - 0039 pe_mmu: component + utility dual role; sub-page region stopgap; D2.1 flags pipeline path missing mmu.overhead_ns timeout (asymmetric with non-pipeline; not visible at default tlb_overhead_ns=0) - 0040 pe_tcm: dual-channel BW serialization (read/write Resource cap=1); TcmRequest schema owned by TCM; timing-only (no data store) - 0041 sram: terminal scratchpad model + ResponseMsg on reverse path; D1.1 flags _worker override as currently dormant (no Transaction actually targets the SRAM node today) - 0042 tiling: pure plan-generator module, not a component; corrects the G4 misclassification; pins GEMM/Math stage sequences and epilogue scope contract Also: /report skill G3 refinement — only flag older->newer asymmetric cross-references; newer->older (e.g., 0034-0037 citing infrastructure ADRs) are expected one-way and no longer reported. Bilingual pair verifier (tools/verify_adr_lang_pairs.py) passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 14:43:03 -07:00
ywkang	049e3d8bb3	benches: package as kernbench.benches, add @bench registry + list subcommand Move benches/ -> src/kernbench/benches/ and src/kernbench/cli/probe.py -> src/kernbench/probes/probe.py. Each bench self-registers via @bench(name=..., description=...); kernbench list enumerates benches with auto-assigned indices, --bench accepts kebab-case name or numeric index. Audit at package-import time fails if any non-underscore module forgets the decorator. ADR-0010 (EN + KO) updated to reflect the new resolver path, list subcommand, and probes package separation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 14:42:10 -07:00
		`@@ -1,2 +0,0 @@`
			`def run(torch):`
			`print("IPCQ all reduce kernel bench")`