attention: milestone-gqa-llama70b figures + MILESTONE_FAST (sub-cycle 4c, 5/6)

Add 5 of the 6 figure renderers ADR-0057 D3 sub-cycle 4c specifies: - gqa_op_log_{panel}.png × 4 — per-panel bar chart of the 5 op_log counts (gemm, ipcq_send, ipcq_recv, dma_read, dma_write). - gqa_comparison.png — cross-panel grouped bars over the same 5 series. Sixth figure (gqa_scaling.png) depends on sub-cycle 4b's Q/cube ∈ {1, 2, 4} sweep on multi_user_* panels and is deferred until that data exists; emit_all_gqa_plots returns just the 5 in-scope paths. Add MILESTONE_FAST=1 mode to run(): skip the panel sweep, reuse the committed sweep.json, render figures only. Validation mode unchanged. The runtime errors clearly when neither env var is set, listing the two supported modes. Renderers live in the bench module (the milestone-1h-gemm pattern); tests/gqa/_gqa_plot_helpers.py re-exports them for figure tests. Tests: tests/gqa/test_plot_gqa_figures.py — 7 tests, all green: - 4 parametrized per-panel emit assertions - 1 comparison emit assertion - 1 emit_all returns exactly 5 PNG paths - 1 default out_dir matches the bench _OUTPUT_DIR Commits the 5 PNG baselines under the bench output dir alongside sweep.json, mirroring milestone-1h-gemm's committed-figures pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
attention: land milestone-gqa-llama70b 4-panel sweep bench (ADR-0057 v1)
2026-06-01 22:23:28 -07:00 · 2026-06-01 21:57:12 -07:00 · 2026-06-01 19:53:18 -07:00 · 2026-06-01 19:33:40 -07:00 · 2026-06-01 19:14:09 -07:00 · 2026-05-22 15:37:27 -07:00
160 changed files with 15572 additions and 1398 deletions
@@ -62,6 +62,10 @@ After writing the document, report to the user in the chat response:
 - **G2 gaps** — ADRs missing **Context** or **Decision**. Alternatives
  and Consequences are optional; their absence is NOT a gap.
 - **G3 gaps** — ADR cross-references without a back-reference.
+  Only flag when the referencer's ADR number is **less than** the
+  referenced ADR's number (older → newer). Newer ADRs citing older
+  infrastructure ADRs (higher number → lower number) are expected to
+  be one-way and are NOT flagged.
 - **G4 suggestions** — areas where an ADR seems missing based on the
  ADR corpus + SPEC reading. Phrase as suggestions, not findings. Each
  G4 item must say *why* it's suggested and remain falsifiable.
@@ -99,7 +103,10 @@ For each `docs/adr/ADR-NNNN-*.md`:
 - Record presence/absence of **Context** and **Decision** for G2.
  Alternatives and Consequences presence is recorded for use during
  authoring, but their absence is not a gap.
- Record ADR-NNNN cross-references for G3.
+- Record ADR-NNNN cross-references for G3, preserving the direction
+  (referencer → referenced). G3 evaluation uses ADR numbers to
+  distinguish older→newer (flagged when missing back-link) from
+  newer→older (not flagged; see *Output Contract* G3).
 - Record Status (e.g., Accepted, Superseded, Draft) and any "supersedes
  ADR-NNNN" text in the body for G5a.

@@ -263,9 +270,11 @@ In **dry-run mode**, replace the `Wrote:` line with:
 - ADR-NNNN: missing <Context|Decision>
 - (or "none")

-**G3 — Broken cross-references**
- ADR-NNNN cites ADR-MMMM; ADR-MMMM does not back-reference
+**G3 — Broken cross-references** (older → newer only)
+- ADR-NNNN cites ADR-MMMM (NNNN < MMMM); ADR-MMMM does not back-reference
 - (or "none")
+- Note: newer ADRs citing older infrastructure ADRs (NNNN > MMMM) are
+  not flagged here — one-way references are the expected pattern.

 **G4 — Suggested topics that may warrant a new ADR (verify before acting)**
 - <topic>: <why agent thinks it may be missing — must be falsifiable>
@@ -371,6 +371,13 @@ Concrete forms that Part 1's *Verification Plan* MUST take in this repo:
 - `kernbench run --device <id>` runs the benchmark on a single device.
 - Omitting `--device` runs the benchmark on all devices discovered in the topology (logically parallel).
 - Device enumeration is handled by the CLI only; benchmarks MUST remain single-device.
+- **Eval-bench exception (ADR-0054)**: a *milestone / eval bench*
+  (`milestone-1h-*`) may drive many configurations and build its own
+  per-config engines to regenerate a domain's full result + figure set; it
+  ignores `--device` and submits a sentinel tensor to satisfy the
+  "must submit ≥1 request" contract (ADR-0045 D4). This is the eval-harness
+  carve-out to the single-device rule, alongside the ADR-0024 multi-SIP CCL
+  exception.

 ## Derived Artifacts (Clarification)

@@ -1,2 +0,0 @@
-def run(torch):
-    print("IPCQ all reduce kernel bench")
@@ -1,40 +0,0 @@
-from __future__ import annotations
-
-import importlib
-from collections.abc import Callable
-from typing import Any
-
-from kernbench.runtime_api.context import RuntimeContext
-
-BenchFn = Callable[[RuntimeContext], Any]
-
-
-def _load_module(bench_id: str):
-    bench_id = bench_id.strip()
-    if not bench_id:
-        raise ValueError("Bench id is empty.")
-    module_path = f"benches.{bench_id}"
-    try:
-        return importlib.import_module(module_path)
-    except ModuleNotFoundError as e:
-        raise ValueError(
-            f"Unknown bench '{bench_id}'. Expected module {module_path}.py"
-        ) from e
-
-
-def resolve_bench(bench_id: str) -> BenchFn:
-    """Resolve a bench id into its ``run(torch)`` callable.
-
-    Expected layout (repo root):
-        benches/<bench_id>.py
-            def run(torch: RuntimeContext) -> Any
-    """
-    mod = _load_module(bench_id)
-    run_fn = getattr(mod, "run", None)
-    if run_fn is None:
-        raise ValueError(
-            f"Bench module benches.{bench_id} must define 'run(torch)'."
-        )
-    if not callable(run_fn):
-        raise ValueError(f"'run' in benches.{bench_id} is not callable.")
-    return run_fn
@@ -6,7 +6,7 @@

 defaults:
  # Algorithm to run for this benchmark execution.
-  algorithm: intercube_allreduce
+  algorithm: lrab_hierarchical_allreduce

  # IPCQ ring buffer location.
  #   tcm  — PE-local TCM (fast, small, conflicts with compute TCM access)
@@ -37,9 +37,14 @@ algorithms:
  # exchange on root cube, then broadcast back. SIP topology is read
  # from topology.yaml → system.sips.topology. Kernel auto-selects
  # ring / torus / mesh inter-SIP exchange pattern.
-  intercube_allreduce:
-    module: kernbench.ccl.algorithms.intercube_allreduce
+  lrab_hierarchical_allreduce:
+    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
    topology: none
    buffer_kind: tcm
    n_elem: 8
+    # root_cube: the kernel currently elects the root dynamically as the
+    # geometric center of the cube mesh (root = (h//2)*w + (w//2)) to
+    # minimize the intra-SIP critical path, so this value is NOT read today.
+    # Kept as a placeholder for a future explicit-root override / runtime
+    # election hook (see ADR-0032 D1 + Non-goals).
    root_cube: 15
@@ -6,10 +6,11 @@ Accepted

 ## Context

-`kernbench` CLI는 시뮬레이터의 사용자 대면 진입점이다. 세 개의 서브명령을
+`kernbench` CLI는 시뮬레이터의 사용자 대면 진입점이다. 네 개의 서브명령을
 노출한다:

 - `run` — 토폴로지에 대해 벤치마크를 실행한다.
+- `list` — 등록된 벤치마크 목록을 출력한다.
 - `probe` — 레이턴시 / 대역폭 측정을 위한 진단 유틸리티.
 - `web` — 인터랙티브 토폴로지 뷰어.

@@ -33,8 +34,9 @@ Accepted

 - `--topology <path>`: 토폴로지 YAML 파일 경로. `resolve_topology()`를
  통해 로드된다.
- `--bench <name>`: 벤치마크 이름. `benches.loader.resolve_bench()`를
-  통해 해석된다.
+- `--bench <identifier>`: 벤치마크 식별자. `kernbench.benches.registry.resolve()`를
+  통해 해석되며, 등록된 kebab-case 이름(예: `gemm-single-pe`) 또는
+  `kernbench list` 의 숫자 인덱스를 모두 받는다.

 선택 인자:

@@ -61,7 +63,22 @@ Accepted
 CLI는 여러 OS 프로세스나 독립된 시뮬레이션 실행을 생성하지 **않는다** —
 병렬성은 단일 시뮬레이션 인스턴스 내부에서 일어난다.

-### D4. `kernbench probe` — 레이턴시 / 대역폭 진단 유틸리티
+### D4. `kernbench list` — 등록된 벤치마크 목록 출력
+
+인자 없음. 각 등록된 벤치의 자동 부여된 인덱스, 등록된 이름,
+한 줄 설명을 출력한다.
+
+벤치는 `@bench(name=..., description=...)` 데코레이터
+(`kernbench.benches.registry`)를 통해 자기 자신을 등록한다.
+`kernbench.benches/` 아래의 언더스코어로 시작하지 않는 모든 모듈은
+반드시 최소 하나의 벤치를 등록해야 한다; 데코레이터가 누락되면
+패키지 import 시점에 `RuntimeError`가 발생한다.
+
+인덱스는 import 시점에 이름의 알파벳 순으로 부여된다. 인덱스는
+`--bench` 의 축약 표기를 위한 CLI 편의 기능이며 안정적인 API가
+아니다 — 알파벳 순으로 새 벤치가 끼면 이후 인덱스가 밀린다.
+
+### D5. `kernbench probe` — 레이턴시 / 대역폭 진단 유틸리티

 필수 인자:

@@ -85,7 +102,7 @@ Probe는 추가로 단조성 불변식을 검증한다 — 예를 들어 local-H
 레이턴시 / 대역폭 모델을 검증하기 위한 개발자 도구이다; 벤치마크가
 아니다.

-### D5. `kernbench web` — 토폴로지 뷰어
+### D6. `kernbench web` — 토폴로지 뷰어

 선택 인자:

@@ -99,7 +116,7 @@ Probe는 추가로 단조성 불변식을 검증한다 — 예를 들어 local-H
 - `kernbench web`은 인터랙티브이다 — 팬/줌, 컴포넌트 속성 호버,
  SIP / CUBE / PE 뷰 간 전환.

-### D6. runtime API와 시뮬레이션 엔진은 디바이스 스코프를 유지한다
+### D7. runtime API와 시뮬레이션 엔진은 디바이스 스코프를 유지한다

 - runtime API 호출은 호출당 하나의 디바이스에서 동작한다.
 - 시뮬레이션 엔진은 모든 요청을 결정론적으로 스케줄링한다.
@@ -108,6 +125,9 @@ Probe는 추가로 단조성 불변식을 검증한다 — 예를 들어 local-H
 이 불변식은 각 레이어를 독립적으로 테스트 가능하게 유지한다; 디바이스
 열거와 다중 디바이스 팬아웃은 오직 CLI의 `run` 명령에만 존재한다(D3).

+`probe` 구현은 `kernbench.probes` 아래에 있다 (`kernbench.benches`와
+분리됨). 이는 probe가 등록된 벤치가 아니라 진단 유틸리티임을 반영한다.
+
 ## Consequences

 - 벤치마크 작성자는 단일 디바이스 로직을 작성한다; 다중 디바이스 동작은
@@ -168,6 +168,36 @@ placement = resolve_dp_policy(
 Post-hoc `pe_index` shifting 없음 — ShardSpec이 `(sip, cube, pe)` 구조적
 좌표를 직접 보유. ShardSpec 상세는 ADR-0026.

+### D5. SIP 그리드 크기 — 명시적 `sips.w/h` 해석
+
+2D inter-SIP topology (`torus_2d`, `mesh_2d_no_wrap`)의 SIP 그리드 형태
+(width × height)는 `system.sips.w` / `system.sips.h`에서 해석한다. D1이
+`sips.count`로 `world_size`를 해석하는 것과 같은 방식이다. 우선순위:
+명시적 `w/h` (`w*h == count` 검증) > 정사각 fallback
+(`w/h` 미지정 시에만 `round(sqrt(count))²`) > error.
+
+```python
+sips = spec.get("system", {}).get("sips", {})
+if sip_topo == "ring_1d":
+    w, h = 0, 0                          # 1D sentinel (no grid)
+elif sips.get("w") is not None and sips.get("h") is not None:
+    w, h = int(sips["w"]), int(sips["h"])
+    if w * h != n_sips:
+        raise ValueError(f"sip layout {w}x{h} != sips.count ({n_sips})")
+else:
+    side = int(round(math.sqrt(n_sips)))
+    if side * side != n_sips:
+        raise ValueError("non-square sips.count requires explicit sips.w/h")
+    w, h = side, side
+```
+
+이로써 2D SIP 그리드가 완전 정사각이어야 한다는 기존 가정을 제거한다:
+6-SIP `torus_2d` / `mesh_2d_no_wrap`은 이제 `w: 3, h: 2`(또는 `2x3`)로
+표현 가능하다. 도출된 `(w, h)`는 알고리즘의 inter-SIP exchange로 전달된다
+(ADR-0032 D5에서 소비). 이전 코드 경로는 ring이 아닌 모든 topology에서
+`round(sqrt(count))²`를 조용히 취해 잘못된 그리드(예: 6 SIP에 2×2)를
+만들었다. fail-loud fallback을 갖춘 명시적 `w/h` 경로가 이를 대체한다.
+
 ---

 ## Dependencies
@@ -31,7 +31,7 @@ pe0만의 same-lane 큐브 간 reduce**, 그 다음 루트 큐브에서 SIP 간

 ### 현재 상태

- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — 커널
+- `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` — 커널
 - `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
 - `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend`가
  `init_process_group` 시점에 자동으로 와이어링한다.
@@ -42,29 +42,46 @@ pe0만의 same-lane 큐브 간 reduce**, 그 다음 루트 큐브에서 SIP 간

 ## Decision

-### D1. 알고리즘 구조 — 5단계
+### D1. 알고리즘 구조 — 5단계 (center-root, 양방향)
+
+루트 큐브는 큐브 메시의 기하학적 **중심**에 위치한다:
+
+```
+root_col  = cube_w // 2
+root_row  = cube_h // 2
+root_cube = root_row * cube_w + root_col   # 중심; 4×4 메시에서 10
+```
+
+각 reduce/broadcast 단계는 이 중심을 향해 **양방향으로** 수렴/발산하여,
+corner-root 워크 대비 SIP 내부 임계 경로를 절반으로 줄인다 (4×4 메시:
+reduce 4홉 + broadcast 4홉 vs SE-코너 루트의 6+6).

 각 SIP에 대해 (`mp.spawn`으로 동시에 launch):

 ```
-Phase 1 — Row reduce W → E (큐브 메시, pe0만):
-    col=0이 E로 송신 → col=1이 누적, E로 송신 → ... → col=3이 row sum 보유.
+Phase 1 — col == root_col에서 수렴하는 Row reduce (큐브 메시, pe0만):
+    좌측 절반(col < root_col)은 W→E로, 우측 절반(col > root_col)은
+    E→W로 진행; root_col 큐브가 양쪽을 병합 → row sum 보유.

-Phase 2 — 최우측 열에서 Col reduce N → S (pe0, col = mesh_w-1):
-    row=0이 S로 송신 → row=1이 누적, S로 송신 → ... → 루트 큐브 (15)가
-    전체 SIP sum 보유.
+Phase 2 — col == root_col에서 row == root_row로 수렴하는 Col reduce:
+    위쪽(row < root_row)은 N→S로, 아래쪽(row > root_row)은 S→N로 진행;
+    루트 큐브가 양쪽을 병합 → 전체 SIP sum 보유.

-Phase 3 — 루트 큐브에서 SIP 간 교환 (루트 큐브의 pe0만):
+Phase 3 — cube_id == root_cube에서 SIP 간 교환 (pe0만):
    Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
    sip_topo_kind(topology.yaml의 sips.topology)로 선택.

-Phase 4 — 최우측 열에서 Col 브로드캐스트 S → N.
+Phase 4 — col == root_col에서 root_row로부터 바깥쪽으로 Col 브로드캐스트.

-Phase 5 — 큐브 메시 전반에 걸친 Row 브로드캐스트 E → W.
+Phase 5 — root_col로부터 바깥쪽으로 큐브 메시 전반에 Row 브로드캐스트.
 ```

 모든 단계가 끝나면 모든 큐브의 pe0이 전역 sum을 보유한다.

+**단일 큐브 fast-path**: `cube_w == cube_h == 1`(rank당 큐브 하나, 일반적인
+TP 케이스)인 경우 SIP 내부 reduce/broadcast 단계를 건너뛰고 곧바로
+Phase 3 SIP 간 교환으로 진행한다.
+
 커널은 `sip_topo_kind ∈ {0, 1, 2}`(ring_1d, torus_2d, mesh_2d_no_wrap)로
 파라미터화된 단일 함수이다. Phase 1-2와 4-5는 토폴로지 전반에서 동일하며,
 phase 3만 분기한다. 헬퍼 함수 `_inter_sip_ring`, `_inter_sip_torus_2d`,
@@ -118,21 +135,24 @@ system:
 ```

 - `ring_1d`: n_sips-1 라운드의 `send global_E / recv global_W`.
- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) 랩핑 메시. `global_E/W`에서
-  row ring, 이어서 `global_S/N`에서 col ring.
- `mesh_2d_no_wrap`: 랩어라운드 없는 정사각형 메시. 차원별 chain
+- `torus_2d`: `w × h` 랩핑 메시. `global_E/W`에서 row ring, 이어서
+  `global_S/N`에서 col ring.
+- `mesh_2d_no_wrap`: 랩어라운드 없는 `w × h` 메시. 차원별 chain
  reduce + 브로드캐스트.

-2D 변형은 `n_sips`가 완전 제곱수여야 한다.
+2D 그리드 크기 `(w, h)`는 `system.sips.w/h`에서 온다 (ADR-0024 D5).
+정사각 fallback (`round(sqrt(n_sips))²`)은 `w/h`가 생략된 경우에만
+적용되므로, 직사각형 그리드(예: 6 SIP을 `3×2`로)는 명시적 `w/h`로
+지원된다.

 ### D5. 프로세스-그룹 통합 — `AhbmCCLBackend`

 `init_process_group` 시점에 백엔드는:

 1. `ccl.yaml` + `topology.yaml`을 로드한다.
-2. 알고리즘 모듈의 `TOPO_NAME_TO_KIND`를 사용하여
-   `system.sips.topology`로부터 `sip_topo_kind, sip_topo_w, sip_topo_h`를
-   도출한다.
+2. `system.sips.topology`로부터 알고리즘 모듈의 `TOPO_NAME_TO_KIND`를
+   통해 `sip_topo_kind`를 도출하고, `sip_topo_w, sip_topo_h`는
+   `system.sips.w/h`에서 정사각 fallback과 함께 도출한다 (ADR-0024 D5).
 3. `configure_sfr_intercube_multisip(engine, spec, cfg)`를 호출한다 —
   일회성 SFR 와이어링, NCCL 커뮤니케이터 생성을 모방한다.

@@ -152,17 +172,19 @@ system:

 ```yaml
 defaults:
-  algorithm: intercube_allreduce
+  algorithm: lrab_hierarchical_allreduce
  buffer_kind: tcm
  ...

 algorithms:
-  intercube_allreduce:
-    module: kernbench.ccl.algorithms.intercube_allreduce
+  lrab_hierarchical_allreduce:
+    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
    topology: none
    buffer_kind: tcm
    n_elem: 8
-    root_cube: 15
+    root_cube: 15   # 현재 사용되지 않음 — 커널이 루트를 기하학적 중심으로
+                    # 동적으로 선출한다 (D1 참조). 향후 명시적 루트 override /
+                    # 런타임 선출 훅을 위한 placeholder로 유지한다.
 ```

 `topology.yaml`:
@@ -202,13 +224,16 @@ sip:

 - **PE별 allreduce** (큐브 내 PE-PE reduce). 범위 밖 — 본 알고리즘의
  워크로드는 큐브당 DP이다.
- **비대칭 SIP 토폴로지** (정사각형이 아닌 메시/토러스).
-  `torus_2d`와 `mesh_2d_no_wrap`은 `n_sips = k²`를 요구한다.
+- **정사각 그리드 fallback은 `n_sips = k²`를 요구**: 직사각형 SIP
+  그리드(정사각형이 아닌 메시/토러스)는 지원되지만, `system.sips.w/h`를
+  명시적으로 줄 때만 가능하다 (ADR-0024 D5). `w/h` 생략 시 2D 토폴로지는
+  정사각 그리드로 fallback하며 여전히 `n_sips = k²`를 요구한다.
 - **파이프라인 청크**: 큐브당 단일 타일, 아직 파이프라이닝 없음.
- **루트 큐브의 런타임 선출**: 커널은 현재 SE 코너로 하드코딩된
-  `root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)`을 사용한다. SFR
-  와이어링이 모든 큐브를 커버하므로, 필요해질 때 런타임 선출은 순수
-  커널 변경이다.
+- **루트 큐브의 런타임 선출**: 커널은 현재 SIP 내부 임계 경로를
+  최소화하기 위해 기하학적 중심인
+  `root_cube = (mesh_h // 2) * mesh_w + (mesh_w // 2)`을 사용한다. SFR
+  와이어링이 모든 큐브를 커버하므로, 필요해질 때 다른 루트를 런타임에
+  선출하는 것은 순수 커널 변경이다.

 ---

@@ -241,15 +266,14 @@ sip:

 | File | Change |
 |---|---|
-| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (신규) | 커널 + `_inter_sip_*` 헬퍼 + `TOPO_NAME_TO_KIND` |
+| `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` (신규) | 커널 + `_inter_sip_*` 헬퍼 + `TOPO_NAME_TO_KIND` |
 | `src/kernbench/ccl/sfr_config.py` (신규) | `configure_sfr_intercube_multisip` |
 | `src/kernbench/ccl/topologies.py` | `torus_2d`, `mesh_2d_no_wrap` 추가 |
 | `src/kernbench/ccl/install.py` | `_OPPOSITE_DIR`을 `global_*` 쌍으로 확장 |
 | `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend`가 `configure_sfr_intercube_multisip` 사용 + sip_rank/topo 인자 추가 |
-| `ccl.yaml` | 단일 `intercube_allreduce` 항목 |
+| `ccl.yaml` | 단일 `lrab_hierarchical_allreduce` 항목 |
 | `topology.yaml` | `system.sips.topology` 추가 |
 | `benches/ccl_allreduce.py` | Row-wise 큐브-메시 텐서 레이아웃 |
-| `tests/test_allreduce_multidevice.py` (신규) | 구성 기반 ring/torus/mesh |
-| `tests/test_distributed_intercube_allreduce.py` (신규) | 전체 `dist.all_reduce` 경로 |
-| `tests/test_intercube_sfr_config.py` (신규) | SFR 와이어링 검증 |
+| `tests/sccl/` (테스트 패키지) | 구성 기반 ring/torus/mesh 정확성 + 전체 `dist.all_reduce` 경로 + latency/buffer-kind 스윕 (평가 하니스 — ADR-0043) |
+| `tests/test_intercube_sfr_config.py` | SFR 와이어링 검증 |
 | 제거 | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` 및 그 테스트 |
@@ -0,0 +1,133 @@
+# ADR-0038: PCIE_EP Component Model
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0035 (M_CPU), ADR-0036 (IO_CPU), ADR-0037 (Forwarding)
+와 같은 결의 컴포넌트-레벨 ADR.
+
+## First action (제일 처음에 하는 일)
+
+`_inbox`에서 Transaction을 한 건 꺼내 `_forward_txn`을 통해 `run()`을 호출하고,
+그 안에서 `node.attrs["overhead_ns"]` 만큼 `env.timeout()`으로 PCIe 프로토콜
+처리 지연을 적용한다. 그 이후 시점부터는 일반 `ComponentBase` 워커가 정의한
+forwarding 규약을 따른다 (다음 hop이 있으면 `out_ports[next_hop].put(...)`,
+아니면 `drain_ns`를 소비하고 `txn.done.succeed()`).
+
+즉, **PCIE_EP의 첫 번째 일은 "PCIe 프로토콜 오버헤드를 시간으로 표현하는 것"**
+하나뿐이고, 라우팅·페이로드 변환·MMIO 디코딩 같은 부가 의사결정은 하지 않는다.
+
+## Context
+
+PCIE_EP는 토폴로지 그래프에서 **호스트와 디바이스 사이의 단방향 경계 포인트**
+역할을 한다. 빌더 (`topology/builder.py`)는 SIP마다 IO chiplet 인스턴스를
+생성하고 그 안에 `pcie_ep`, `io_cpu`, `io_noc`을 둔 뒤, 외부 호스트 측의 cross-SIP
+switch와 `pcie_ep` 사이에 양방향 엣지를 깐다:
+
+- `switch → pcie_ep`: host → device 트래픽 (MemoryWrite, MemoryRead, KernelLaunch).
+- `pcie_ep → switch`: device-side outbound (예: cross-SIP IPCQ 토큰).
+
+IOChiplet 내부적으로는 `pcie_ep ↔ io_noc` 양방향 엣지가 깔리고, 그 다음 hop이
+`io_cpu`나 cube 측 hbm_ctrl 경로로 분기된다 (ADR-0036 IO_CPU 모델 참고).
+라우터·리졸버는 SPEC R7이 요구하는 "PCIE_EP는 메모리 오퍼레이션을 위한
+엔드포인트"라는 계약을 이미 인지하고 있어, `find_pcie_ep(sip)`,
+`find_memory_path(pcie_ep, dst_node)` 같은 helper가 PCIE_EP를 시작점으로 한다.
+
+문제는 이 모든 의존 관계가 builder/router/resolver 쪽에는 있으나, **PCIE_EP
+자신의 내부 모델을 명시하는 ADR이 없다**는 것이다. 결과적으로:
+
+- "PCIE_EP는 어떤 latency를 모델링하나?"가 코드를 읽어야만 답이 나온다.
+- 다른 컴포넌트(IO_CPU=ADR-0036, M_CPU=ADR-0035)와의 비대칭이 발생한다.
+- 향후 PCIe link-layer 모델(예: TLP credit, retry)을 더 정교하게 만들지에 대한
+  의사결정 근거가 흩어진다.
+
+이 ADR은 현재의 **얇은 (thin) PCIE_EP 모델**을 명시적으로 못 박고, 그것이
+의도된 단순화임을 기록한다 (ADR-0033 latency model 단순화 정책과 정렬).
+
+## Decision
+
+### D1. PCIE_EP는 ComponentBase의 일반 forwarding 워커를 그대로 사용한다
+
+`PcieEpComponent`는 `ComponentBase`를 상속하며 `_worker`/`_forward_txn`을
+오버라이드하지 않는다. 따라서 모든 Transaction은 다음 순서로 처리된다:
+
+1. `_fan_in`이 들어오는 메시지(또는 Flit reassembly된 Transaction)를 `_inbox`에
+   적재한다.
+2. `_worker`가 `_inbox`에서 하나 꺼내 `env.process(self._forward_txn(env, txn))`로
+   포크한다 (per-message 파이프라이닝).
+3. `_forward_txn`이 op_log 시작 hook → `run()` 지연 → op_log 종료 hook 순서로
+   호출한다.
+4. `run()`은 단 한 줄: `yield env.timeout(overhead_ns)`.
+5. 다음 hop이 있으면 `out_ports[next_hop].put(txn.advance())`, 없으면 (terminal로
+   도착한 경우) `drain_ns`를 소비 후 `txn.done.succeed()`.
+
+### D2. PCIE_EP의 유일한 시간 모델은 `overhead_ns`다
+
+`node.attrs["overhead_ns"]`만 latency 파라미터로 인정한다. 코드 기본값은
+`0.0`이며, `topology.yaml` 의 IOChiplet `components.pcie_ep.attrs` 가 실제 값을
+지정한다 (현재 토폴로지: `overhead_ns: 5.0` ns).
+
+별도의 BW 직렬화 자원(simpy.Resource), 큐 깊이, retry 모델은 두지 않는다.
+링크-레벨 BW 직렬화는 wire-side에서 처리된다 — IOChiplet 내부는
+`pcie_ep_to_noc_bw_gbs = 256.0 GB/s` 링크, 외부는 system의 `io_ep_to_switch`
+링크 BW가 적용된다 (ADR-0015 port/wire 모델). PCIE_EP 컴포넌트 자체는 이
+BW 회계에 관여하지 않는다.
+
+### D3. PCIE_EP는 양방향 사용을 인지하지만, 방향에 따라 동작을 바꾸지 않는다
+
+토폴로지 빌더가 `switch ↔ pcie_ep` 와 `pcie_ep ↔ io_noc` 양방향 엣지를 깐다.
+따라서 PCIE_EP는:
+
+- inbound (host→device): switch에서 도착한 Transaction을 io_noc 쪽으로 다음 hop
+  계산을 통해 forward.
+- outbound (device→host): io_noc/io_cpu에서 도착한 Transaction을 switch 쪽으로
+  forward.
+
+두 경우 모두 D1의 일반 forwarding 워커가 처리하며, 컴포넌트 코드 자체는 방향을
+구분하지 않는다 (`txn.next_hop`만 따른다).
+
+### D4. PCIE_EP는 Flit-aware가 아니다 (legacy reassembly 경로)
+
+`_FLIT_AWARE`를 `True`로 두지 않는다. 따라서 `_fan_in`이 상류에서 chunkify된
+Flit들을 부모 Transaction으로 재조립하여 `_inbox`에 넣는다 (ADR-0033 Phase 2c
+점진적 rollout 정책과 정렬).
+
+PCIE_EP가 PCIe TLP-level credit 모델을 갖도록 확장될 미래에 D4를 재평가한다.
+
+### D5. PCIE_EP는 라우팅 helper의 **명명된 노드**다
+
+`policy/routing/router.py`의 `find_pcie_ep(sip, io_id="io0")`,
+`find_all_pcie_eps()`, `find_memory_path(pcie_ep, dst_node)`는 PCIE_EP를 메모리
+경로의 시작점(또는 종점)으로 간주한다. 컴포넌트 본체는 이 helper에 어떤 정보도
+제공하지 않으며, 명명 규칙(`sip{S}.{io_id}.pcie_ep`)은 토폴로지 빌더가 보장한다.
+
+## Alternatives Considered
+
+### A1. PCIe TLP-level 모델 (credit, retry, MPS 분할)
+
+기각. ADR-0033이 명시한 "현재 latency 모델은 abstract overhead + BW 직렬화로
+표현"이라는 단순화 원칙에 어긋난다. 호스트↔디바이스 protocol 정합성은 SPEC §5
+"Non-Goals"에 의해 의도적으로 out-of-scope이다.
+
+### A2. PCIE_EP에 자체 simpy.Resource로 inflight 제한 두기
+
+기각. 현재 워크로드에서 호스트 트래픽은 컨텐션 병목이 아니다. 필요해지는 시점에
+별도 ADR로 도입한다 (호환성 측면에서 D1은 그대로 두고 D2를 확장하는 형태).
+
+### A3. PCIE_EP를 IO_CPU와 합치기
+
+기각. PCIE_EP는 host-side에서 처음 만나는 protocol boundary 노드이고, IO_CPU는
+디바이스-쪽 control-plane 처리 노드다 (ADR-0036). 트래픽 fan-out·command 디코딩
+같은 의사결정 비용은 IO_CPU에 모이며, PCIE_EP는 link-edge overhead만 표현하는
+것이 의미가 있다. 합치면 두 책임이 섞여 ADR-0007 (runtime API/sim_engine 경계)
+정신에 어긋난다.
+
+## Consequences
+
+- PCIE_EP는 코드 라인이 거의 0인 채로 명시적인 모델 ADR을 갖게 된다 — 일관성
+  ↑, 유지보수 비용 ↓.
+- 향후 PCIe-level 정밀화가 필요해지면 D2/D4를 확장하는 새 ADR을 만들어
+  supersede한다.
+- `find_memory_path` 등 router helper가 PCIE_EP를 명명된 노드로 의존한다는
+  사실이 D5에서 명시되므로, 컴포넌트 ID 명명 규칙 변경 시 영향 범위가 명확해진다.
@@ -0,0 +1,194 @@
+# ADR-0039: PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0011 (PA/VA/LA address model) 의 VA 모델에서 "PE_MMU가 VA→PA 변환"이라고만
+선언되어 있는데, **PE_MMU 컴포넌트 자신의 동작 모델**을 별도로 못 박는 ADR.
+
+## First action (제일 처음에 하는 일)
+
+생성 시점에 `node.attrs["page_size"]` (default `2 MiB`) 와
+`node.attrs["tlb_overhead_ns"]` (default `0.0`) 를 읽어 내부 `PeMMU` 객체
+(`policy.address.pe_mmu.PeMMU`) 를 단 한 번 인스턴스화한다. 이 객체가 페이지
+테이블·서브페이지 region 리스트·TLB 오버헤드의 단일 보유자(single owner)이다.
+
+런타임에서의 첫 동작은 두 갈래로 갈린다:
+
+- **컴포넌트 경로 (inbox 소비)**: `_worker`가 `_inbox`에서 Transaction을 한 건
+  꺼내, 그 `request`가 `MmuMapMsg`이면 각 엔트리에 대해
+  `self._mmu.map(va, pa, size)`를 호출하고 `txn.done.succeed()`.
+  `MmuUnmapMsg`이면 `unmap(va, size)`, 그 외 타입이면 표준 `_forward_txn`으로
+  떨군다. 즉 **MMU의 첫 일은 "map/unmap 명령을 페이지 테이블에 반영하는 것"**.
+- **유틸리티 경로 (직접 호출)**: PE_DMA / PE_GEMM 같은 동일 PE 내부 엔진이
+  `pe_mmu.mmu.translate(va)`를 직접 호출한다. 이 경로에서는 SimPy 이벤트가
+  발생하지 않으며, 호출자가 (overhead_ns > 0인 경우) 본인 process에서
+  `yield env.timeout(mmu.overhead_ns)`를 처리한다.
+
+## Context
+
+ADR-0011은 PA/VA/LA 세 가지 주소 모델을 정의하고 "VA 모델 = PE_MMU를 통한 변환"
+이라고만 합의했다. 그러나 코드 상의 `PeMmuComponent`는 두 가지 상호 보완적인
+역할을 동시에 수행한다:
+
+1. **토폴로지 그래프 상의 컴포넌트**: cube NoC에서 `MmuMapMsg` / `MmuUnmapMsg`
+   sideband 메시지를 수신하여 페이지 테이블을 갱신한다.
+2. **PE-로컬 유틸리티 객체**: 동일 PE의 PE_DMA / PE_GEMM이 latency 0으로 (혹은
+   호출자 측에서 `overhead_ns`만 부담하면서) 직접 `translate(va)`를 호출한다.
+
+이 두 역할을 모두 다루는 ADR이 없어 다음 모호함이 발생한다:
+
+- "왜 MMU 변환에 SimPy 이벤트가 안 잡히나?" (실제로는 호출자 측에서 잡고 있음)
+- 서브페이지 region 모델은 무엇이고, 왜 그 모델인가? (코드 docstring에는 있으나
+  ADR이 없음 — `project_mmu_subpage_stopgap`라는 memory note 참조만 존재)
+- map/unmap 메시지가 **누구로부터** 와서 **언제까지** 갱신되어야 하는가
+  (ordering 계약)?
+
+또한 `PeMMU.map()` 은 "later append, last-write-wins (역방향 탐색)" 의미를 갖는데,
+이것은 단순한 단일-PA 페이지 테이블 모델로는 표현 불가능한 DPPolicy의 서브페이지
+샤딩 (예: 128B 페이로드 × 4KB 페이지) 시나리오를 위해 의도적으로 추가된
+**stopgap**이다. 진짜 HW MMU와는 다른 단순화임을 ADR로 못 박을 필요가 있다.
+
+## Decision
+
+### D1. 이중 역할의 명시 — 컴포넌트와 유틸리티
+
+`PeMmuComponent`는 단일 클래스 안에서 다음 두 인터페이스를 노출한다:
+
+- 컴포넌트 인터페이스: `_inbox` 소비, `_worker` 루프 (MMU sideband 메시지 처리).
+- 유틸리티 인터페이스: `pe_mmu.mmu` 속성으로 underlying `PeMMU` 객체를 노출 —
+  PE_DMA / PE_GEMM이 이 객체를 직접 들고 `translate()`를 호출.
+
+후자는 **layer skip이 아니다**: PE 내부는 ADR-0007이 정의한 "components" 레이어
+하나 안의 sibling 관계이고, 같은 PE prefix에서 가져온 PE_MMU 객체에 대한 직접
+호출은 cross-layer가 아니다. cross-layer 위반은 runtime API / sim_engine /
+components 경계를 넘는 경우에만 적용된다.
+
+### D2. Latency 모델: `translate()`는 순수 함수, overhead는 호출자 책임
+
+`PeMMU.translate()`는 순수 함수이며 SimPy yield를 하지 않는다. 호출자(PE 엔진)
+가 변환 후 `if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)`
+를 자기 process에서 발생시킨다.
+
+이유: PE 엔진의 SimPy process는 이미 자체 record_start / record_end (op_log)
+hook을 들고 있어 timing을 일관되게 잡을 수 있다. MMU가 별도의 process를 만들면
+PE 엔진의 처리 흐름을 두 갈래로 쪼개 op_log/pipeline overlap 의미가 흐려진다.
+
+#### D2.1. 현재 구현의 비대칭 — pipeline vs non-pipeline (Known asymmetry)
+
+본 ADR 작성 시점의 `pe_dma.py` 구현은 두 호출 경로에서 overhead 처리가 다르다:
+
+- **non-pipeline (`handle_command`)**: `translate()` 직후
+  `if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)` 를
+  발생시킨다.
+- **pipeline (`_do_pipeline_dma`)**: `translate()` 만 호출하고 overhead timeout을
+  **생략**한다 — 함수 주석에 "same logic as non-pipeline path"라고 적혀 있으나
+  실제로는 일치하지 않는다.
+
+기본 토폴로지에서 `tlb_overhead_ns = 0.0` 이라 이 차이는 timing에 직접 드러나지
+않으나, `tlb_overhead_ns > 0` 으로 설정한 시뮬레이션에서는 pipeline 경로의
+GEMM/Math 가 non-pipeline 동일 워크로드 대비 MMU overhead 만큼 빠르게 측정된다.
+
+D2의 계약은 "**모든** 호출자가 overhead를 책임진다" 이며, pipeline 경로의 누락은
+**의도된 설계가 아니라 구현 비일관성**이다. ADR-0014 D6 (pipeline self-routing)
+이 이 overhead를 면제한다고 명시한 부분은 없다.
+
+조치 선택지(별도 Phase 1/2 제안 필요):
+
+- (a) `_do_pipeline_dma` 에서도 `if mmu.overhead_ns > 0: yield env.timeout(...)`
+  를 추가하여 D2 계약과 일치시킨다 — 권장.
+- (b) D2 계약을 "non-pipeline 경로에만 적용" 으로 좁히고, pipeline 경로의 면제를
+  ADR-0014 D6 갱신과 함께 정당화한다 — overhead 의미가 약해지므로 비권장.
+
+본 ADR은 (a) 를 권장하며, accept 전 또는 직후의 별도 작은 변경으로 이를
+교정하는 것을 가정한다.
+
+### D3. 페이지 테이블 구조 — 서브페이지 region 리스트 (stopgap)
+
+`self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]]`
+구조로 한 페이지 안에 여러 disjoint region을 보유할 수 있다.
+- `map(va, pa, size)`: 페이지를 가로지르면 region들을 **append**한다.
+- `translate(va)`: VPN으로 region 리스트를 가져온 후, **역방향**으로 순회하며
+  처음 매칭되는 region을 채택 (last-write-wins).
+- `unmap(va, size)`: extent가 unmap 범위에 **완전히 포함된** region만 제거한다.
+  경계가 어긋난 부분 overlap은 그대로 남기며, 매핑 호출자는 mapping과 동일한
+  경계로 unmap할 책임을 진다.
+
+이는 진짜 HW MMU와는 다른 **시뮬레이터 stopgap**임을 ADR-0011 VA 모델 보강
+요소로 명시한다. DPPolicy 서브페이지 샤딩 시 last-write-wins overwrite로 인한
+조용한 미스라우팅을 방지하기 위함이다 (메모리 노트: project_mmu_subpage_stopgap).
+
+### D4. PageFault는 PA fallback 신호다
+
+매핑이 없는 VA로 `translate()`가 호출되면 `PageFault`가 발생한다. PE_DMA는 이
+예외를 잡아 **원본 주소를 PA로 그대로 사용**한다 (ADR-0011의 PA fallback 호환
+경로). 따라서 PageFault는 에러가 아닌 "VA 매핑 부재 시 PA로 해석한다"는 신호다.
+
+이 호환 경로는 ADR-0011이 합의한 PA-only 모드와의 후방 호환을 유지하기 위한
+의도된 동작이다.
+
+### D5. MMU sideband 메시지의 수신 계약
+
+`MmuMapMsg` / `MmuUnmapMsg`는 fabric을 통해 PE_MMU 컴포넌트의 `_inbox`로
+도달한다 (R10이 명시하는 "MMU map 설치는 fabric latency를 따른다"). 메시지
+schema는 runtime API (`runtime_api/kernel.py`) 가 정의하며, 현재 형식:
+
+- `MmuMapMsg.entries: tuple[dict, ...]` — 각 dict는 `{"va": int, "pa": int,
+  "size": int}` 키를 갖는다.
+- `MmuUnmapMsg.entries: tuple[dict, ...]` — 각 dict는 `{"va": int, "size": int}`
+  키를 갖는다.
+
+PE_MMU 측 수신 처리:
+
+1. `_worker` 가 `_inbox.get()` 에서 메시지 한 건을 꺼낸다.
+2. `hasattr(msg, "request")` 로 Transaction wrapper 인지 확인.
+3. `isinstance(msg.request, MmuMapMsg)` 이면 각 entry 에 대해
+   `self._mmu.map(va=e["va"], pa=e["pa"], size=e["size"])`.
+4. `isinstance(msg.request, MmuUnmapMsg)` 이면 각 entry 에 대해
+   `self._mmu.unmap(va=e["va"], size=e["size"])`.
+5. 둘 다 `msg.done.succeed()` 로 완료 통지.
+
+외부 호출자(runtime API 측)가 `done`을 await하면 "매핑이 디바이스에 설치된
+시점"이 SimPy 시간으로 보장된다 — 이 wait이 ADR-0011이 요구하는 "MMU map
+installation incurs measured fabric latency" 의 실현이다.
+
+이 ADR은 sideband 메시지의 **sender 와 fan-out 정책**을 정의하지 않는다 —
+그것은 runtime API 책임이다. 본 ADR은 PE_MMU 측 수신 계약만 명시한다.
+
+### D6. 비-MMU Transaction은 일반 forwarding으로 위임
+
+`_worker`가 inbox에서 꺼낸 메시지의 `request`가 `MmuMapMsg` / `MmuUnmapMsg`가
+아닌 경우 (또는 `request` 속성이 없는 경우) `_forward_txn`으로 떨군다. 이는
+미래에 PE_MMU가 cube-internal NOC 상의 통과 노드로 사용될 가능성을 차단하지
+않기 위함이다 (현재는 그런 통과 트래픽이 없으나, 토폴로지 변경에 대해 안전).
+
+## Alternatives Considered
+
+### A1. translate()를 SimPy generator로 만들기
+
+기각. D2에서 설명한 대로, PE 엔진의 op_log/pipeline overlap 의미가 흐려진다.
+호출자 측에서 timeout을 일으키는 현재 패턴이 op_log 회계와 일치한다.
+
+### A2. 서브페이지 region 리스트 대신 페이지 크기 자체를 작게 하기 (예: 128B)
+
+기각. 페이지 테이블 메모리 폭발과 cube-wide map message 크기 폭발을 초래한다.
+DPPolicy 샤딩이 128B를 요구한다 해도 그 외 대다수 매핑은 2MiB 단위이므로,
+페이지 크기를 작게 잡는 것은 평균 비용이 비대해진다.
+
+### A3. PE_MMU를 컴포넌트가 아닌 PE_CPU의 내장 헬퍼로만 두기
+
+기각. ADR-0011이 요구하는 "fabric을 통해 측정된 latency로 MMU map 설치"
+(MmuMapMsg 경로)를 표현하려면 토폴로지 그래프 상의 노드여야 한다. 또한 cube NoC
+visualizer에서 PE_MMU가 노드로 보여야 디버깅·진단이 일관된다.
+
+## Consequences
+
+- PE_MMU의 이중 역할(컴포넌트 + 유틸리티)이 ADR-level에서 정당화되어, 미래의
+  refactor 압박 (둘 중 하나로 통일하라)에 대한 논거가 생긴다.
+- 서브페이지 region 모델이 시뮬레이터 stopgap임을 ADR이 명시 — 이후 LA 모델
+  (ADR-0011) 도입 시 이 stopgap 제거 가능성을 평가하는 기준이 된다.
+- `translate()`가 yield하지 않는다는 계약이 ADR로 굳어지므로, 향후 누군가
+  "MMU에 자체 timeout을 넣자"는 제안을 할 때 D2를 근거로 거절할 수 있다.
+- PA fallback (D4) 이 정상 흐름임이 명시되어, PageFault를 에러로 오인하여
+  방어 로직을 추가하는 일을 막는다.
@@ -0,0 +1,142 @@
+# ADR-0040: PE_TCM Component Model — 듀얼 채널 BW 직렬화
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0014 (PE Pipeline Execution Model) 가 "PE_TCM은 BW-기반 직렬화 scratchpad
+memory" 라고 언급하나 (D1), TCM 컴포넌트 자체의 정확한 동작 모델을 별도로
+명시한다.
+
+## First action (제일 처음에 하는 일)
+
+`start()`가 호출되면 즉시 두 개의 `simpy.Resource(env, capacity=1)`을 만들고
+`self._read_res` / `self._write_res`에 보관한다. 이 두 자원이 **읽기 채널**과
+**쓰기 채널**을 각각 1-in-flight로 직렬화하는 단일 결정 포인트다.
+
+런타임 첫 동작: `_worker`가 `_inbox`에서 메시지를 한 건 꺼내 타입 분기:
+
+- `TcmRequest` (`pe_fetch_store`에서 옴): `env.process(self._handle_tcm_request)`로
+  포크. 즉 **TCM의 첫 일은 "방향 (read/write)에 맞는 채널 락을 잡는 것"**.
+  락 획득 후 `bw > 0 and nbytes > 0` 이면 `delay_ns = nbytes / bw` 만큼
+  `env.timeout`, 그리고 `req.done.succeed()`.
+- 그 외 (Transaction): `env.process(self._forward_txn)`로 포크 (legacy fabric
+  통과 경로).
+
+생성 시점에 `node.attrs["read_bw_gbs"]` / `node.attrs["write_bw_gbs"]`
+(default 각 `512.0 GB/s`) 를 읽어 보관해 둔다.
+
+## Context
+
+PE 파이프라인 (ADR-0014 D1, D6) 에서 PE_TCM은 다음 두 종류의 트래픽을 받는다:
+
+1. **PE_FETCH_STORE → PE_TCM의 `TcmRequest`** — TCM ↔ Register File 전송 시,
+   PE_FETCH_STORE가 TCM의 BW로 직렬화된 access latency를 받아오기 위해 짧은
+   sideband 요청을 보낸다 (`direction = "read"` 또는 `"write"`, `nbytes`,
+   `done` 이벤트).
+2. **legacy Transaction forwarding** — 토폴로지 그래프 상에서 TCM이 통과 노드로
+   잡힐 가능성에 대비한 일반 forwarding 경로 (현재 critical path에서는 사용되지
+   않으나 보존됨).
+
+문제: ADR-0014는 "PE_TCM은 BW-기반 직렬화"라고만 언급한다. 그러나 코드에는
+명시적으로:
+
+- **읽기와 쓰기는 별도 채널이며 동시 진행 가능**, 다만 같은 방향끼리는
+  cap=1로 직렬화된다.
+- BW는 `read_bw_gbs` / `write_bw_gbs` 두 값으로 분리 설정 가능하다.
+- `delay_ns = nbytes / bw_gbs` 공식 (단위 환산: GB/s × ns ≈ B 라는 약식).
+- nbytes==0이면 BW 항을 건너뛰지만 채널 락은 잡는다.
+- `run()`은 `overhead_ns` (default 0.0) 만큼 yield 하나, 이는 legacy fabric
+  경로(Transaction forwarding)에서만 사용된다.
+
+이 모든 사항을 별도 ADR로 못 박을 필요가 있다. 특히 "왜 read/write가 분리
+채널인가" 와 "BW는 누가 결정하는가" 는 향후 누군가가 capacity=2 등으로 변경하려
+할 때 명확한 근거가 필요한 항목이다.
+
+## Decision
+
+### D1. 듀얼 채널 — read와 write는 독립 자원
+
+`_read_res = simpy.Resource(env, capacity=1)`,
+`_write_res = simpy.Resource(env, capacity=1)`.
+같은 방향의 동시 요청은 자원 큐에서 직렬화되나, 다른 방향끼리는 동시에 진행 가능.
+이는 실제 HW에서 TCM이 듀얼 포트 (read port + write port) 로 운용되는 모델과
+정합되며, GEMM 파이프라인에서 fetch(read)와 store(write)가 시간상 겹치는 정상
+케이스를 BW-직렬화 모델로 표현하기 위해 의도된 분리다.
+
+### D2. 단일 채널의 BW 모델 — `nbytes / bw_gbs`
+
+채널 락 획득 후, `nbytes > 0 and bw > 0`이면 `yield env.timeout(nbytes / bw_gbs)`.
+단위 약식은 GB/s × ns ≈ B 로, 시뮬레이터 전체에서 사용하는 BW 공식과 동일
+(ADR-0033 참고 — 시뮬레이터는 일관된 약식 단위를 사용한다).
+
+- `nbytes == 0`: BW 항은 0이지만 락은 잡혔다가 즉시 풀린다. 이 케이스가 의도된
+  이유: 빈 fetch/store를 보내는 plan generator가 PE_FETCH_STORE 측에서 `nbytes`만
+  0으로 채워 보내는 경우에도, TCM 측의 op_log / 채널 회계가 일관되게 한 번
+  소비된다.
+- `bw == 0` (config 실수): timeout 호출 자체를 skip하므로 0-time pass. 정상
+  세팅에서는 발생하지 않는다.
+
+### D3. BW는 `node.attrs`의 `read_bw_gbs` / `write_bw_gbs`로 설정
+
+기본값 `512.0 GB/s`. 토폴로지 빌더 (`topology/builder.py`) 가 `pe_template`에서
+TCM을 인스턴스화할 때 해당 attrs를 전달한다. 기본값 변경은 ADR-0014 D1 또는
+ADR-0033 latency model 측의 의사결정과 함께 가야 한다.
+
+### D4. TcmRequest의 schema는 PE_TCM이 owner다
+
+`@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
+는 `components/builtin/pe_tcm.py`에 정의된다. PE_FETCH_STORE는 이 dataclass를
+import해서 생성·송신만 한다. 호출자 측이 schema를 정의하지 않는 이유:
+
+- BW 직렬화의 의미는 TCM 측 책임 — 어떤 필드가 직렬화 결정에 쓰이는가는 TCM이
+  결정한다.
+- `direction` 문자열을 `"read"` / `"write"` 둘로 좁히는 유효값 검증도 TCM 측에
+서 담당 (`_handle_tcm_request`의 if/else 분기).
+
+### D5. legacy Transaction forwarding 경로의 보존
+
+`_worker`가 `TcmRequest`가 아닌 메시지를 받으면 `_forward_txn`으로 보낸다. 이때
+`run()`의 `overhead_ns`가 적용된다. 현재 표준 PE 파이프라인에서는 TCM이
+Transaction의 통과 노드로 잡히지 않으나, fabric 토폴로지가 향후 변경될 때를
+위해 보존한다 (D1 의 사용 패턴과 직교).
+
+이 경로는 op_log 측에서 일반 Transaction 회계로 잡히며, BW 채널 락은 잡지 않는다.
+
+### D6. PE_TCM은 자체 데이터 저장소가 아니다 (timing only)
+
+TCM은 **시간만** 모델링한다. 실제 데이터 페이로드는 sim_engine의 별도
+`memory_store` (있다면) 가 보관하고, TCM 컴포넌트는 그것을 갱신하지 않는다.
+PE_FETCH_STORE도 TcmRequest를 통해 BW 지연만 받아오고 실제 register 컨텐츠는
+별도 경로로 다룬다 (ADR-0020 2-pass data execution 모델 — Phase 2에서 데이터
+처리).
+
+## Alternatives Considered
+
+### A1. 단일 채널 (capacity=2 의 read+write 공유)
+
+기각. fetch(read)와 store(write)가 시간상 겹치는 정상 케이스를 인공적으로
+직렬화하게 되어 PE 파이프라인의 BW upper bound가 잘못 모델링된다.
+
+### A2. 채널 capacity > 1 (예: 2-banked TCM)
+
+기각. 현재 HW 모델은 단일 bank 가정. 멀티-bank로 확장하고 싶다면 별도 ADR이
+필요하며, 그때 D1을 supersede한다. 지금 단계에서 capacity를 늘리면 BW upper
+bound는 그대로인데 명목상의 직렬화만 헐거워져 실제 모델 정확도 ↓.
+
+### A3. BW 공식을 `nbytes / bw + overhead_ns`로 일반화
+
+기각. `overhead_ns`는 D5의 legacy forwarding 경로에만 사용한다. fetch/store
+critical path에 추가 overhead가 필요해지면, 그것은 TCM이 아니라 PE_FETCH_STORE
+측 `run()` 또는 register-file access 모델에 두는 것이 책임 경계 측면에서 더
+적절하다.
+
+## Consequences
+
+- TCM의 BW 회계가 ADR-level에서 굳어지므로, GEMM/Math sweep의 op_log 해석 시
+  "왜 fetch와 store가 동시에 진행되었나" / "왜 같은 방향만 직렬화되나" 같은
+  질문이 빠르게 D1으로 해결된다.
+- 미래의 멀티-bank TCM이나 read/write 비대칭 BW 모델 변경 시 영향 범위가
+  명확해진다 (D1·D2·D3 중 어디를 수정하는지).
+- TCM이 데이터 저장소가 아니라는 점(D6)이 명시되어, ADR-0020 2-pass execution
+  과의 책임 경계가 견고해진다.
@@ -0,0 +1,187 @@
+# ADR-0041: Cube SRAM Component Model — terminal scratchpad on cube NoC
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0017 (Cube NOC and HBM Connectivity) 에서 SRAM이 cube NoC의 attachment로
+존재한다고만 언급되는 점을 보완하여, SRAM 컴포넌트 자체의 latency/response
+모델을 명시한다.
+
+## First action (제일 처음에 하는 일)
+
+`_worker`가 `_inbox`에서 Transaction을 한 건 꺼낸 직후 가장 먼저 하는 일은
+`yield from self.run(env, txn.nbytes)` 호출이고, 그 안에서
+`node.attrs["overhead_ns"]` (default `0.0`) 만큼 `env.timeout()`을 발생시킨다.
+
+즉, **SRAM의 첫 일은 "access overhead를 시간으로 표현하는 것"**이다.
+overhead 소비 이후에 `drain_ns` (그 Transaction에 부여된 terminal BW 직렬화 비용)
+를 yield하고, 그 다음에 reverse path로 `ResponseMsg`를 생성하여 발사한다.
+
+이는 일반 `ComponentBase._worker`와 다른 점이 있다: SRAM은 **terminal node**
+임을 알고 있어서 `_forward_txn`을 거치지 않고 자체 워커가 `run → drain →
+_send_response` 순서를 명시한다.
+
+## Context
+
+cube 토폴로지 (`topology/builder.py`) 는 cube마다 다음 명명된 노드를 만든다:
+
+- `sip{S}.cube{C}.m_cpu`
+- `sip{S}.cube{C}.sram`
+- `sip{S}.cube{C}.hbm_ctrl` (PE당 partition)
+- `sip{S}.cube{C}.pe{P}` (PE 내부 sub-component들)
+
+SRAM은 cube NoC 의 attachment 중 하나로, 가장 가까운 router에 부착된다
+(`topology/mesh_gen.py`가 placement 좌표로 nearest router 결정 후 `attach`에
+추가). 빌더는 `sram ↔ router` 양방향 엣지를 깐다 (BW: `sram_to_router_bw_gbs`,
+기본 `128.0 GB/s`).
+
+SRAM의 두 가지 핵심 역할:
+
+1. **fabric terminal**: cube NoC에서 SRAM으로 향한 메모리 access Transaction의
+   끝점. SRAM이 access overhead와 drain을 소비하고 response를 reverse path로
+   되돌린다.
+2. **IPCQ slot tier 중 하나**: ADR-0023 D9.7 가 정의한 `buffer_kind ∈ {tcm,
+   sram, hbm}` 중 `sram` 티어의 slot bw/overhead를
+   `common/ipcq_types._BUFFER_KIND_BW`에서 참조 — 현재 값 `(512.0 GB/s, 2.0 ns)`.
+   이 값은 SRAM 노드 attrs의 `overhead_ns`와는 별도이며, IPCQ slot 회계 시점에서
+   PE_DMA가 시간으로 환산한다.
+
+이 두 역할은 하나의 SRAM 컴포넌트에서 동시에 충족되는데, 별도 ADR이 없으면:
+
+- "SRAM은 어떤 latency를 모델링하나?" — fabric drain + overhead, 아니면 IPCQ
+  티어의 slot latency? — 답이 흩어진다.
+- 미래에 SRAM 크기 (`size_mb`) attr이 실제로 어떤 의미를 갖는지 불명확. 현재
+  코드는 size를 사용하지 않으며 timing만 모델링한다.
+- SRAM이 cube의 어떤 router에 붙는지 (placement-based)에 대한 의사결정 근거가
+  토폴로지 코드 안에만 있다.
+
+## Decision
+
+### D1. SRAM은 cube NoC의 terminal scratchpad 노드다
+
+`SramComponent`는 `ComponentBase`를 상속하나 `_worker`를 오버라이드해서 terminal
+의미를 직접 표현한다:
+
+```
+while True:
+    txn = yield self._inbox.get()
+    yield from self.run(env, txn.nbytes)     # overhead_ns
+    if drain_ns > 0: yield env.timeout(drain_ns)
+    yield from self._send_response(env, txn)
+```
+
+이 패턴은 SRAM이 reverse path를 알아야 하므로 일반 `_forward_txn` (다음 hop으로
+forward)이 아닌 자체 워커가 필요하다.
+
+#### D1.1. 현재 미사용 — `_worker` 오버라이드는 dormant 경로다
+
+본 ADR 작성 시점의 코드베이스에서는, **어떤 컴포넌트도 SRAM 노드로 Transaction
+을 실제로 전송하지 않는다**. 확인된 SRAM 노드 ID 참조 위치:
+
+- `policy/routing/router.py` 등 routing helper — path 조회 가능성만 보장.
+- `components/builtin/pe_dma.py::_handle_ipcq_inbound` — IPCQ slot의
+  `buffer_kind == "sram"` 일 때 `bank_node = f"{cube_prefix}.sram"` 의 *path*
+  만 조회하여 `compute_drain_ns(path, ...)` 로 환산, **로컬에서 timeout** 한다.
+  Transaction 자체는 SRAM 노드로 흘러가지 않는다 (D4 참고).
+- `tests/test_routing.py` — `find_path("sip0.cube0.pe0", "sip0.cube0.sram")`
+  로 connectivity만 검증.
+
+따라서 `_worker`/`_send_response` 오버라이드는 **dormant code path** 이다.
+삭제하지 않고 보존하는 이유:
+
+- 향후 SRAM이 실제 fabric Transaction의 종점(예: M_CPU → SRAM 명시 access)이
+  되는 토폴로지 변경 시 즉시 사용 가능.
+- ADR-0017 (Cube NOC) 가 정의한 cube-attached scratchpad 의미에서 종점 동작은
+  의미상 자연스러우므로, 의도된 placeholder 다.
+
+이 dormant 상태가 종료되는 시점은 별도 ADR(또는 본 ADR의 후속 revision)이
+명시한다.
+
+### D2. ResponseMsg 생성과 reverse path 발사
+
+`_send_response`는:
+
+1. `reverse_path = list(reversed(txn.path))`로 역방향 경로 산출.
+2. `ResponseMsg(correlation_id=txn.request.correlation_id, request_id=...,
+   src_cube=<this cube>, src_pe=-1, success=True)` 생성.
+3. `Transaction(request=resp_msg, path=reverse_path, step=0, nbytes=0,
+   done=env.event(), is_response=True)` 로 감싸 `out_ports[reverse_path[1]]` 로
+   put.
+4. reverse path가 비정상이거나 (`< 2 hops`) ctx가 없으면, fallback으로 원본
+   `txn.done.succeed()` 만 호출.
+
+`src_pe = -1`은 "SRAM은 PE-localized가 아니다"를 의미한다. `src_cube`은 노드
+ID (`sip{S}.cube{C}.sram`) 의 cube 인덱스를 파싱해 채운다.
+
+### D3. Timing 파라미터는 `overhead_ns`와 wire-side `drain_ns`로 분리
+
+- **컴포넌트 측 latency**: `node.attrs["overhead_ns"]`. 기본 토폴로지에서는 `2.0
+  ns` 정도로 세팅.
+- **링크 측 직렬화**: `drain_ns`는 Transaction이 도착 시점에 carry해 온 값으로,
+  ADR-0015 (port/wire 모델) 의 wire-side BW 직렬화 결과다. SRAM은 이를 그대로
+  yield하기만 한다.
+- `size_mb` (default `32 MiB`) attr은 현재 timing에 사용되지 않는다 — 향후
+  capacity-aware 모델이 도입되면 그때 의미를 부여한다 (별도 ADR에서).
+
+### D4. IPCQ slot 회계는 SRAM 컴포넌트가 직접 모델링하지 않는다
+
+ADR-0023 D9.7 에 따른 IPCQ slot의 SRAM-티어 write latency는 PE_DMA의
+`_handle_ipcq_inbound`가 직접 `slot_io_latency_ns("sram", nbytes)`를 호출하여
+시간을 소비한다 (그 함수는 `common/ipcq_types._BUFFER_KIND_BW["sram"]` 의 값을
+사용). 즉:
+
+- SRAM 컴포넌트가 fabric Transaction을 받아 처리할 때는 **D1·D2·D3** 만 적용.
+- IPCQ slot이 SRAM에 살 때는 PE_DMA가 IPCQ slot-write 시점에 별도로 시간을
+  지불 — 이는 SRAM 컴포넌트 코드와 무관하며, IPCQ 측 회계다.
+
+이 분리는 의도된 것: IPCQ는 fast path (sub-cycle slot bookkeeping) 라 fabric
+Transaction을 거치지 않으므로, SRAM이 IPCQ를 인지할 필요가 없다.
+
+### D5. SRAM의 cube NoC 부착 위치는 placement-driven
+
+`topology/mesh_gen.py`는 `placement.sram.pos_mm` (`topology.yaml` 기본
+`[1.5, 9.0]`)을 보고 가장 가까운 router의 `attach`에 `"sram"`을 추가한다. 빌더
+(`topology/builder.py` 의 attachment 루프)가 그 attach 정보를 보고 `sram` 노드와
+router 사이에 양방향 엣지를 깐다.
+
+이 의사결정은 SRAM 컴포넌트 코드 외부 (mesh_gen / builder) 에 있으며, 컴포넌트
+는 어느 router에 붙었는지 알 필요가 없다. 컴포넌트는 `txn.path` / `reverse_path`
+가 router를 거쳐 자신에게 도달한다는 사실만 알면 된다.
+
+### D6. SRAM은 자체 데이터 저장소가 아니다 (timing-only)
+
+ADR-0040 D6 과 같은 맥락: SRAM 컴포넌트는 시간만 모델링하며, 실제 데이터
+페이로드는 sim_engine의 `memory_store` (있을 때) 가 보관한다.
+
+## Alternatives Considered
+
+### A1. SRAM이 `_forward_txn`을 그대로 사용하고 IO_CPU / HBM_CTRL 처럼 별도 응답 노드를 두기
+
+기각. cube NoC 상에서 SRAM은 terminal이며, 응답을 받아 줄 별도 노드를 두면
+의미 없는 hop이 늘어나고 ADR-0017 의 cube NoC 단순화 정신에 어긋난다.
+
+### A2. SRAM이 BW 직렬화를 자체 resource로 모델링
+
+기각. 링크 측 BW 직렬화 (`drain_ns`) 가 이미 의미를 충분히 잡고 있다. 컴포넌트
+내부에 또 `simpy.Resource`를 두면 ADR-0015 wire-side 모델과 이중계산을 야기.
+
+### A3. SRAM이 IPCQ slot 회계를 컴포넌트 측에서 처리
+
+기각. D4에서 명시한 대로 IPCQ는 fast path며 fabric Transaction을 통과하지
+않는다. SRAM이 IPCQ를 인지하면 책임이 두 갈래로 갈라져 추론이 어려워진다.
+
+### A4. `size_mb`로 capacity-aware latency 모델
+
+기각 (현재 단계). capacity는 토폴로지 visualizer 측 라벨링 정도에만 쓰이며,
+실제 timing 영향은 아직 모델링하지 않는다. 필요해지면 별도 ADR로 도입.
+
+## Consequences
+
+- SRAM의 timing 모델이 `overhead_ns + drain_ns + ResponseMsg(reverse_path)`로
+  ADR-level에서 굳어지므로, 누군가 IPCQ slot latency를 SRAM 컴포넌트에 추가하려
+  할 때 D4를 근거로 거절할 수 있다.
+- `size_mb` 가 현재 timing-neutral 임이 명시되어 (D3), 미래의 capacity-aware
+  모델 도입 시 호환성 영향 범위가 좁다.
+- placement-driven router 부착 (D5) 이 명시되어, SRAM 좌표 이동 시 어떤 부분에
+  파급이 있는지 (`mesh_gen`만) 명확해진다.
@@ -0,0 +1,194 @@
+# ADR-0042: Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더
+
+## Status
+
+Accepted (2026-05-20).
+
+본 ADR은 `tiling.py`가 SimPy 컴포넌트가 아니라
+**plan-generator 모듈**임을 명시한다.
+
+ADR-0014 (PE Pipeline Execution Model) 의 D6 (tile plan / self-routing) 가
+tile-plan 생성 알고리즘을 직접 정의하지 않으므로, 본 ADR이 그 비어 있는 자리를
+채운다.
+
+## First action (제일 처음에 하는 일)
+
+`generate_gemm_plan(M, K, N, tile_m, tile_k, tile_n, ..., pe_prefix, a_pinned,
+b_pinned, epilogue_specs)`이 호출되면 가장 먼저 하는 일은 **타일 수 계산과
+컴포넌트 ID 문자열 구성**이다:
+
+```
+M_tiles = max(1, ceil(M / tile_m))
+K_tiles = max(1, ceil(K / tile_k))
+N_tiles = max(1, ceil(N / tile_n))
+dma_id   = f"{pe_prefix}.pe_dma"
+fetch_id = f"{pe_prefix}.pe_fetch_store"
+gemm_id  = f"{pe_prefix}.pe_gemm"
+math_id  = f"{pe_prefix}.pe_math"
+```
+
+즉 **plan generator의 첫 일은 "타일 개수를 ceiling으로 산출하고, 이 PE의
+sub-component ID 4개를 한 번에 짜놓는 것"**이다. SimPy 이벤트나 환경 객체는
+일절 다루지 않는다 — 이 모듈은 순수 함수다.
+
+`generate_math_plan(M, N, tile_m, tile_n, ..., math_op, src_addr, dst_addr,
+pe_prefix)` 도 마찬가지로 `M_tiles`, `N_tiles` 산출과 component ID 3개
+(`dma_id`, `fetch_id`, `math_id`) 구성이 첫 일이다.
+
+## Context
+
+ADR-0014 D6은 "PE_SCHEDULER가 CompositeCmd를 받으면 TilePlan을 생성하고
+self-routing tile token을 피드한다"고만 합의했다. 그러나 코드에서는 **plan
+생성 알고리즘의 구체적 내용**이 `src/kernbench/components/builtin/tiling.py`
+모듈에 자리잡고 있고, 이 모듈은:
+
+- 컴포넌트가 아니라 **순수 함수**의 모음이다 (`generate_gemm_plan`,
+  `generate_math_plan`).
+- SimPy 환경, 큐, op_log, hook 등에 의존하지 않는다.
+- 결과로 `PipelinePlan` (dataclass) 를 돌려준다.
+
+기존 G4 분석은 `tiling.py`를 컴포넌트로 잘못 가정했으나, 실제는 PE_SCHEDULER에
+주입되는 plan-builder 함수다. 이 차이는 ADR-0014 의 D6 와 짝을 이루는 별도
+ADR로 못 박혀야 한다 — 그렇지 않으면:
+
+- "tile plan을 만드는 책임이 PE_SCHEDULER인가 별도 모듈인가" 가 모호.
+- GEMM plan과 Math plan의 stage sequence 가 일관성 있는지 (예: FETCH/STORE 위치)
+  의사결정 근거가 흩어진다.
+- `a_pinned` / `b_pinned` / `epilogue_specs` 같은 옵션이 왜 plan 단에서 분기되는지
+  근거 없음.
+
+## Decision
+
+### D1. tiling은 순수 plan-generator 모듈이며 컴포넌트가 아니다
+
+`components/builtin/tiling.py`는 ComponentBase 하위 클래스를 정의하지 않는다.
+모듈-레벨 함수 두 개만 노출한다:
+
+- `generate_gemm_plan(...) -> PipelinePlan`
+- `generate_math_plan(...) -> PipelinePlan`
+
+토폴로지 그래프에서 `tiling` 이라는 노드는 존재하지 않는다. 명명상 `builtin/`
+디렉터리에 있는 이유는 PE_SCHEDULER (ADR-0014 D6) 의 직접 helper이기 때문이며,
+의미상으로는 PE_SCHEDULER 내부 utility에 가깝다.
+
+### D2. GEMM plan의 stage 시퀀스 — `M → N → K` order
+
+각 (m, n, k) 타일에 대한 stage 시퀀스 (operand pinning과 epilogue 미적용 기본):
+
+```
+[DMA_READ(A)] → [DMA_READ(B)] → FETCH → GEMM
+                                ↑
+                                ↓
+(last k tile only)              [MATH(output_tile)]* → STORE → DMA_WRITE
+```
+
+`k_tile` epilogue는 매 K-타일마다 GEMM 직후, `output_tile` epilogue는 (m,n)당
+마지막 K-타일에서 STORE/DMA_WRITE 직전에 한 번. K-루프 누적자(accumulator) 는
+RegFile에 남아 K 타일들 사이에 STORE/DMA_WRITE가 발생하지 않는다 (last_k에서만
+출력).
+
+### D3. Operand pinning — `a_pinned` / `b_pinned`
+
+호출자가 `a_pinned=True`로 호출하면 **모든 (m, n, k) 타일에서 A DMA_READ를
+생략**한다. 의미: 호출자(예: `tl.composite`)가 사전에 `tl.load`로 A 전체를
+TCM에 한 번 적재했음을 plan generator에 알리는 신호.
+
+이 분기는 plan 단에서 결정한다 (런타임 분기 아님). 따라서 op_log 상의 stage
+record 수는 pinning에 따라 결정적으로 달라지며, sweep 분석 측 (예: gemm_sweep
+의 stage record count) 이 이 결정을 그대로 본다.
+
+### D4. Epilogue scope — `k_tile` vs `output_tile`
+
+`epilogue_specs`는 op-spec 객체의 iterable이다. 각 op 객체는 다음 속성을 갖는
+다고 가정한다:
+
+- `op.kind: str` — math op 이름 (예: `"dequant"`, `"bias"`, `"relu"`, `"scale"`).
+  stage의 `params["op_kind"]` 로 들어간다.
+- `op.scope: Scope` — `Scope.K_TILE` 또는 `Scope.OUTPUT_TILE` (`Scope` 는
+  `kernbench.common.pe_commands` 에 정의된 enum).
+- op-별 추가 필드 (예: `bias`, `scale`, `factor`) — 현재 plan generator는 사용
+  하지 않으며 런타임 (PE_MATH) 측이 소비.
+
+plan generator는 `getattr(o, "scope", None)` 기준으로 두 그룹으로 분기:
+
+- `scope == Scope.K_TILE`: 매 K-타일 GEMM 직후 MATH stage 추가.
+- `scope == Scope.OUTPUT_TILE`: (m, n)당 마지막 K-타일 STORE 직전 MATH stage
+  추가.
+
+`scope` 속성이 없거나 두 enum 어느 쪽도 아닌 op는 **plan에 포함되지 않는다**
+(`getattr(..., None) == Scope.X` 가 둘 다 False). 기본값(`output_tile`) 채택은
+**호출자(예: `tl.composite`) 측 책임**이며, plan generator는 이미 채워진 scope
+값을 보고 분기할 뿐이다 (ADR-0014 의 composite epilogue 계약과 정렬).
+
+`Scope` 임포트는 `pe_commands ← pe_types ← tiling` 의 순환 참조를 피하기 위해
+함수 내부에서 lazy import 한다. 이는 의도된 패턴이며 개선 대상이 아니다 (D1의
+"tiling은 PE_SCHEDULER의 utility" 관점에서, pe_commands에 대한 컴파일타임 의존
+이 없는 편이 모듈 경계를 깔끔히 유지함).
+
+### D5. Math plan의 stage 시퀀스 — `M → N` order
+
+각 (m, n) 타일에 대한 stage 시퀀스:
+
+```
+DMA_READ → FETCH → MATH → STORE → DMA_WRITE
+```
+
+K 차원이 없으므로 epilogue / accumulator residency 같은 개념은 적용되지 않는다.
+PE_FETCH_STORE의 register-file 회계는 GEMM plan과 동일한 방식으로 다뤄진다.
+
+### D6. plan은 데이터다 — SimPy 의존성 없음
+
+`PipelinePlan` 은 `pe_types.py`에 정의된 dataclass로, `tiles: list[TilePlan]`을
+보유. 각 `TilePlan` 은 `stages: tuple[Stage, ...]` 를 보유. plan 자체는
+immutable에 가까운 데이터 구조이며 (Stage 의 `params: dict` 만 mutable),
+SimPy 객체나 event를 갖지 않는다.
+
+런타임 시점에 PE_SCHEDULER가 plan 의 첫 stage를 보고 `TileToken`을 생성하여
+파이프라인에 피드하며, TileToken 이 `plan: TilePlan`, `stage_idx: int`,
+`params: dict` 를 들고 다닌다. self-routing은 `TileToken.advance()` 가 다음
+stage의 `params`를 캐시하는 방식으로 진행된다 (ADR-0014 D6).
+
+### D7. plan generator의 contract — pure, deterministic, idempotent
+
+같은 입력으로 두 번 호출하면 같은 PipelinePlan을 돌려준다 (`TilePlan.stages`의
+순서까지 deterministic). 이 contract는 ADR-0014 D6 의 "결정적 tile dispatch
+순서" 요구와 정렬된다.
+
+부수효과(SimPy event, file I/O, 글로벌 상태) 없음 — 테스트에서 환경 객체 없이
+호출 가능 (`tests/test_pe_pipeline.py`의 일부 케이스가 이 방식 사용).
+
+## Alternatives Considered
+
+### A1. tiling을 컴포넌트로 만들기 (e.g., PE_PLANNER)
+
+기각. plan 생성은 SimPy 시간을 소비하지 않는 결정 알고리즘이다. 컴포넌트로
+만들면 (a) inbox·자원 등 불필요한 인프라가 따라붙고, (b) PE_SCHEDULER 가
+"plan 받기" → "tile 피드" 두 단계를 분리해 받게 되어 의미 없는 hop이 생긴다.
+
+### A2. plan 생성을 PE_SCHEDULER 클래스 메서드로 옮기기
+
+기각 (현재). 모듈 분리가 (1) 테스트 용이성, (2) 다른 plan 알고리즘 (예:
+DTensor-aware plan) 도입 시 추가 함수만 정의하면 되는 확장성을 준다. 만약 향후
+plan 종류가 많아져 명시적 dispatch가 필요해지면, 그때 PE_SCHEDULER에 plan
+factory를 두는 것을 별도 ADR로 도입한다.
+
+### A3. plan을 immutable로 강제 (frozen dataclass + tuple)
+
+부분 채택. `Stage` 와 `TilePlan` 은 dataclass지만 frozen은 아니다. 이유:
+`Stage.params: dict` 가 plan generator 시점에 채워지고 런타임에서 읽히기만 한다
+(TileToken 이 advance 시 캐시할 뿐). 완전 frozen은 dict → frozendict 마이그레이션
+비용 대비 이득이 적다. 다만 plan 단계 외에는 mutation 하지 말 것을 컨벤션으로
+유지한다.
+
+## Consequences
+
+- `tiling.py`가 컴포넌트가 아니라 plan-generator 모듈임이 ADR-level에서
+  명시되어, G4 같은 미래의 "이 컴포넌트는 ADR이 없다"는 분석을 차단한다.
+- GEMM plan의 stage sequence (D2) 와 pinning/epilogue 분기 (D3·D4) 가 ADR로
+  굳어지므로, sweep 분석 (`scripts/gemm_sweep.py`)의 stage record count 해석
+  근거가 명확해진다.
+- plan generator의 pure contract (D7) 덕분에 테스트가 환경 없이 plan 검증
+  가능 — ADR-0013 (verification strategy) 의 "behavior validated by tests with
+  meaningful input cases" 정신과 정렬.
+- 향후 DTensor-aware plan, K-major plan 등 새 plan 종류 추가 시 본 ADR이
+  baseline 역할 — 새 함수만 추가하고 D1·D6·D7을 따른다.
@@ -0,0 +1,131 @@
+# ADR-0043: Allreduce 평가 하니스 — `tests/sccl/`
+
+## Status
+
+Accepted
+
+`tests/sccl/` 평가 하니스를 문서화한다; 구현과 대조 검증 완료
+(상수, 파일 집합, 스윕 차원을 교차 확인).
+
+**ADR-0054로 개정됨**: 드라이버 코어, sweep, renderer가 `milestone-1h-ccl`
+bench(단일 home)로 이동했다; `tests/sccl/_allreduce_helpers.py`는 이제 거기서
+re-export한다(pytest 전용 param 빌더 + `_run_distributed` wrapper는 로컬
+유지). figure 테스트는 변경 없음.
+
+## Context
+
+ADR-0032는 intercube all-reduce *알고리즘*을 정의하고, ADR-0023/0024/0027은
+IPCQ 백엔드, rank=SIP launcher, `mp.spawn`을 정의한다. 그러나 어느 것도
+**allreduce를 어떻게 구동하고 특성화하는가** — 정확성 테스트, latency/
+buffer-kind 스윕, 파생 플롯 — 는 기술하지 않는다. ADR-0013(verification
+strategy)이 일반 정책이라면, 본 ADR은 구체적 allreduce 하니스를 고정하여
+작업의 "평가" 절반이 구현과 함께 문서화되도록 한다.
+
+하니스는 `tests/sccl/`(allreduce 테스트 통합 시 생성된 패키지)에 위치한다.
+이전의 평면적 `tests/test_allreduce_multidevice.py` +
+`tests/test_distributed_*` 레이아웃을 대체한다.
+
+## Decision
+
+### D1. 평가를 공개 `torch.distributed` 경로로 구동
+
+정확성과 스윕은 collective를 실제 DDP 형태 경로 —
+`init_process_group(backend="ahbm") → mp.spawn → dist.all_reduce`
+(ADR-0024/0027) — 로 실행하며, 하위 레벨 `ctx.launch`를 쓰지 않는다.
+`tests/sccl/_allreduce_helpers.py`의 공유 헬퍼
+`_run_distributed(tmp_path, monkeypatch, topo_path, corr_id, n_elem)`가
+엔진을 빌드하고 워커를 실행하고 `(engine, n_cubes)`를 반환한다.
+`monkeypatch.chdir`이 백엔드의 `load_ccl_config()`(cwd 조회)를 케이스별
+임시 `ccl.yaml`로 향하게 한다.
+
+직접 launch 레퍼런스(`run_allreduce`)는 같은 헬퍼 모듈에 유지된다 —
+distributed 테스트용이 아니라, `tests/`의 IPCQ buffer-kind / root-center
+마이크로 테스트가 import하기 때문이다.
+
+### D2. 평가 관심사별 파일 하나
+
+| 파일 | 관심사 | `torch.distributed`? |
+|---|---|---|
+| `test_allreduce_ring_torus_mesh.py` | ring_1d / torus_2d (2×3) / mesh_2d_no_wrap (2×3) 정확성 | yes |
+| `test_distributed_default_topology.py` | `topology.yaml` 그대로의 전체 경로 | yes |
+| `test_plot_latency_sweep.py` | latency 스윕 행 (n_elem × topology) | yes |
+| `test_plot_buffer_kind_sweep.py` | TCM/SRAM/HBM 스윕 행 | yes |
+| `test_plot_topology_diagram.py` | topology.png (순수 matplotlib) | no |
+| `test_plot_comparison_fsim.py` | broken-axis 모델 vs FSIM 비교 | no |
+| `test_intercube_root_center.py` | ADR-0032 center-root latency 가드 (직접 경로) | no |
+
+`_allreduce_helpers.py`는 공유 plumbing(드라이버, config writer, 스윕/
+buffer-kind 상수, 플롯 aggregator, topology-diagram + FSIM 비교 emitter)을
+보유한다. 수집되지 않는다(`test_` 접두사 없음).
+
+### D3. Latency 메트릭 — critical-path `pe_exec_ns`
+
+config별 보고 latency는 `engine._results`에 대한
+`crit_ns = max(pe_exec_ns)` — 가장 느린 rank의 PE 실행 시간 — 이다.
+모든 latency 차트에 그려지고 `summary.csv`에 기록되는 값이다.
+
+### D4. 스윕 차원
+
+- **Latency 스윕**: `n_elem ∈ {8, 32, 64, 128, 512, 1024, 2048, 4096,
+  8192, 16384, 32768, 49152}` (16 제외 — `n_cubes`와 충돌) × topology ∈
+  {ring_1d (6), torus_2d 2×3 (6), mesh_2d_no_wrap 2×3 (6)}.
+- **Buffer-kind 스윕**: `buffer_kind ∈ {tcm, sram, hbm}` × 더 작은
+  `n_elem` 그리드, torus_2d 6-SIP (3×2)에서. buffer_kind는 임시
+  `ccl.yaml`에 설정되며(백엔드가 `init_process_group` 시점에 읽음,
+  ADR-0023 D6) 적용된다.
+
+2×3 / 3×2 그리드는 명시적 `w/h` SIP 해석(ADR-0024 D5)을 행사한다.
+
+### D5. `pytest_sessionfinish` aggregator를 통한 파생 플롯
+
+스윕 테스트는 xdist 친화적이다: 각 parametrized 케이스가 staging 디렉터리에
+JSON 행 하나를 쓴다. conftest `pytest_sessionfinish` 훅(controller 노드
+전용)이 `_allreduce_helpers.py`의 aggregator를 호출한다:
+
+- `_aggregate_sweep_plots()` → topology별 PNG + `summary.csv`
+- `aggregate_buffer_kind_plot()` → TCM/SRAM/HBM 비교 PNG + csv
+
+topology-diagram 및 FSIM-비교 figure는 각자의 `test_plot_*` 테스트가
+직접 emit한다(행 staging 없음 — 각각 `topology.yaml`과 `summary.csv`의
+순수 함수). 모든 출력은 `docs/diagrams/allreduce_latency_plots/`에 떨어지며
+CLAUDE.md에 따라 **파생 아티팩트**다(ADR과 일관, Phase-2 게이트 없음).
+
+### D6. FSIM 비교 레퍼런스는 하드코딩 상수
+
+`emit_comparison_fsim_plot()`은 모델 곡선을 외부 FSIM single-device
+레퍼런스(`366 µs`) 하나와 겹쳐 그리며, 이는 리터럴로 보유된다 — 외부 데이터
+파일 없음. "measured" 시리즈는 시뮬레이터(`op_log` GEMM 카운트,
+`composite_window_ns`)에서, "theoretical" 시리즈는 손으로 도출한 해석적
+모델(ADR-0044 D5가 ADR-미검증으로 표시한 동일 모델)에서 온다.
+
+## Consequences
+
+### Positive
+
+- allreduce가 실제 DDP 스크립트와 같은 API로 평가되므로, 하니스가
+  ADR-0024/0027의 통합 테스트 역할도 겸한다.
+- figure는 매 `pytest` 실행마다 committed 데이터로 재생성된다; 수동 플롯
+  단계 없음.
+- 직사각형 그리드 스윕이 ADR-0024 D5 `w/h` 수정을 드러낸 회귀 커버리지를
+  제공했다.
+
+### Negative / limitations
+
+- 전체 latency 스윕은 기본 `pytest`에서 실행된다(~분 단위); `slow`로
+  표시되지 않는다. (ADR-0044는 GEMM 스윕을 `slow`로 표시하는 것과 대조.)
+- `test_intercube_root_center.py`는 latency *임계값* assertion(ADR-0032
+  center-root 가드)을 보유한다 — 스위트에서 유일한 절대-latency
+  assertion이며 latency 모델 변경(ADR-0033)에 민감하다.
+
+## Dependencies
+
+- **ADR-0013**: verification strategy (본 ADR이 특수화하는 일반 정책).
+- **ADR-0023 / ADR-0024 / ADR-0027**: IPCQ 백엔드, rank=SIP launcher,
+  `mp.spawn` — D1이 구동하는 경로.
+- **ADR-0032**: 평가 대상 알고리즘; D4 그리드가 그 topology 분기를 행사.
+- **ADR-0044**: 형제 격인 GEMM 평가 하니스.
+
+## Open questions
+
+- GEMM 스윕과의 일관성을 위해 latency 스윕을 `slow`로 표시할 것인가?
+- FSIM 레퍼런스를 하드코딩 상수에서 버전 관리되는 데이터 파일로 옮길 것인가?
@@ -0,0 +1,133 @@
+# ADR-0044: GEMM 평가 하니스 — `scripts/gemm_sweep.py` + `tests/gemm/`
+
+## Status
+
+Accepted
+
+GEMM 평가/특성화 하니스를 문서화한다; 구현과 대조 검증 완료
+(상수, tile 크기, figure 집합, script↔test 분할을 교차 확인). D5/D6
+caveat은 부정확이 아니라 기록된 한계다.
+
+**ADR-0054로 개정됨**: sweep + renderer가 `milestone-1h-gemm` bench(단일
+home)로 이동했다; `scripts/gemm_sweep.py`와 `tests/gemm/`는 이제 거기서
+re-export한다. D1/D2의 "데이터 생성은 수동 script / 무거운 작업은 opt-in"은
+평가-bench 패턴으로 대체된다(하나의 bench가 전부 재생성;
+`MILESTONE_FAST=1`은 committed JSON 재사용).
+
+## Context
+
+ADR-0014(PE pipeline)와 ADR-0042(tile-plan generator)는 GEMM *구현*을
+정의하고, ADR-0033은 latency 모델을 정의한다. 그러나 어느 것도 **GEMM
+성능을 어떻게 스윕하고 특성화하는가** — 타이밍 데이터를 만드는 shape/variant
+스윕과 이를 해석하는 figure — 는 기술하지 않는다. 본 ADR이 그 하니스를
+고정한다.
+
+allreduce 하니스(ADR-0043)와 달리 GEMM 스윕은 **무겁다**(24 sim 실행:
+8 shape × 3 operand-staging variant; `512` shape 하나가 2048 tile). 이
+무게가 아래 분할을 결정한다.
+
+## Decision
+
+### D1. 두 계층 분할 — 무거운 데이터 생성(script) vs. 빠른 figure(test)
+
+- **데이터 생성은 수동 script로 유지**: `scripts/gemm_sweep.py`가
+  `matmul-composite`(ADR-0042 plan)를 CLI와 동일한 `run_bench` 경로로
+  shape × variant에 걸쳐 실행하고, `result.engine.op_log`를 수확하여
+  `docs/diagrams/gemm_sweep.json`(stage별/engine별 wall-clock + occupancy
+  + record count + pe/composite window)을 쓴다.
+- **figure 렌더링은 test 생성**: `tests/gemm/`이 committed `gemm_sweep.json`을
+  읽어 matplotlib PNG를 `docs/diagrams/gemm_plots/`에 렌더링한다. 이
+  테스트는 빠르고 기본 실행된다.
+
+근거: 슬라이드덱 규모의 sim 스윕은 매 `pytest` 실행에 속하지 않지만,
+figure(저렴·결정적)는 자유롭게 재생성되고 CI로 가드되어야 한다. 이는
+CLAUDE.md의 script-vs-test 분할(무거운/수동 생성은 script; 빠른 assertion은
+test)을 반영한다.
+
+### D2. Slow regenerator 테스트가 script를 감싼다
+
+`tests/gemm/test_gemm_sweep.py`는 `@pytest.mark.slow`로 표시된다(기본
+`addopts: -m "not slow"`에서 제외). 이는 `scripts/gemm_sweep.py`를
+subprocess로 호출하여 `gemm_sweep.json`을 on-demand로 재생성한다
+(`pytest -m slow tests/gemm/test_gemm_sweep.py`). 스윕 로직은 단일
+home(script)을 가지며 테스트는 이를 감싸기만 하므로 sim 구동 코드의
+중복이 없다.
+
+### D3. Figure 집합 (3개 차트, `load_ref` variant)
+
+| 테스트 | PNG | 내용 |
+|---|---|---|
+| `test_plot_gemm_stage_breakdown.py` | `gemm_stage_breakdown.png` | stage별 engine wall-clock (DMA in / Fetch / GEMM / DMA out) |
+| `test_plot_gemm_mac_utilization.py` | `gemm_mac_utilization_measured.png` | GEMM util % + useful eff % |
+| `test_plot_gemm_mac_utilization.py` | `gemm_mac_utilization_theoretical_vs_measured.png` | theoretical vs 시뮬레이터-measured util/eff |
+
+`tests/gemm/_gemm_plot_helpers.py`가 공유 renderer를 보유한다(시리즈 로직은
+`scripts/build_overview_slides.py`의 GEMM `_render_*` 함수를 미러링하며,
+그쪽은 여전히 PPTX에 네이티브로 그린다). 수집되지 않음(`test_` 접두사
+없음). 각 `test_plot_*`는 `gemm_sweep.json`이 없으면 skip한다.
+
+### D4. Tile 크기는 데이터 기반; under-tile shape는 표시
+
+Tile 크기는 `gemm_sweep.json`(`tile_sizes`)에서 읽으며, 이는 스윕이
+`PeSchedulerComponent.TILE_M/K/N = 32/64/32` — 권위 소스 — 에서 기록한
+값이다. `M<TILE_M ∨ K<TILE_K ∨ N<TILE_N`인 shape는 차트에
+("under-tile") 표시된다. `512³` shape는 figure에서 제외된다
+(`EXCLUDED_SHAPES`).
+
+### D5. Theoretical 모델 — 상속된 상수, 아직 ADR-미검증
+
+"theoretical" 곡선은 `scripts/build_overview_slides.py`에서 그대로 복사한
+상수로 해석적 ideal-pipeline 모델을 사용한다:
+
+```
+HBM_GBS = 256.0   # GB/s        T_STAGE = 16.0 ns
+D_STAGES = 3                    BPE = 2
+```
+
+**이 값들은 아직 ADR과 대조 소싱되지 않았다.** 특히 ADR-0033의 `256`은
+`burst_bytes`(256 B)로 이 `256 GB/s`와 *다른* 양이며, ADR-0033은
+대역폭을 `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`로 도출한다.
+`T_STAGE`/stage 수도 여기서 ADR-0014로 추적되지 않았다. 따라서 모델은
+**기존 deck script와 일관할 뿐 ADR과 검증되지 않았고**, 상수가 중복된다
+(deck + helper). 이를 조정(topology/ADR-0033/0014에서 소싱, 중복 제거)하는
+것은 보류 — Open questions 참조.
+
+### D6. 알려진 네이밍 caveat — `_measured` 차트
+
+`gemm_mac_utilization_measured.png`는 현재 *theoretical* ideal-pipeline
+수치를 그린다(footnote가 그렇게 명시). 파일명만 "measured"라고 한다. 이는
+그 내용을 시뮬레이터-measured 시리즈로 재지정할지 또는 제목을 바꿀지
+결정을 보류 중인 알려진 misnomer다.
+
+## Consequences
+
+### Positive
+
+- GEMM figure가 allreduce처럼 test 생성·CI 가드된다.
+- 무거운 스윕은 opt-in으로 유지되어 기본 테스트 실행이 빠르다.
+- 스윕 로직의 단일 소스(script)를 slow 테스트가 재사용.
+
+### Negative / limitations
+
+- theoretical 모델 상수(D5)는 미검증·중복이다.
+- `_measured` figure는 misnomer(D6).
+- `build_overview_slides.py`는 여전히 이 PNG를 임베드하지 않고
+  `gemm_sweep.json`에서 GEMM 막대를 네이티브로 그린다 — test 아티팩트를
+  소비하도록 deck를 재배선하는 작업은 미완.
+
+## Dependencies
+
+- **ADR-0013**: verification strategy.
+- **ADR-0014 / ADR-0042**: PE pipeline + tile-plan generator — 스윕이
+  측정하는 GEMM 구현; D4의 stage record count는 ADR-0042 D2/D3에서 온다.
+- **ADR-0033**: latency 모델 — D5 상수가 (아직은 아니지만) 추적되어야 할
+  소스.
+- **ADR-0043**: 형제 격인 allreduce 평가 하니스.
+
+## Open questions
+
+- D5 상수를 `topology.yaml` / ADR-0033 / ADR-0014와 대조 조정하고
+  중복 제거할 것인가(모델 파라미터의 단일 소스)?
+- D6 `_measured` 네이밍 해결(내용 재지정 vs. 제목 변경)?
+- `build_overview_slides.py`를 네이티브 막대 그리기 대신 `gemm_plots/`
+  PNG 임베드로 재배선할 것인가?
@@ -0,0 +1,265 @@
+# ADR-0045: Bench Module Contract — registration, dispatch, and authoring
+
+## Status
+
+Accepted (2026-05-21).
+
+`src/kernbench/benches/` 패키지의 등록 메커니즘(@bench), CLI 디스패치 경로
+(`kernbench run/list`), 그리고 새 bench 모듈 작성 시 따라야 할 계약을 통합
+정의한다. ADR-0010 (CLI surface)이 `kernbench list/run` 인터페이스를 명세하나,
+**bench가 어떻게 등록되고 어떤 함수 시그너처를 따라야 하는가**는 ADR 레벨에
+없었음.
+
+**ADR-0054로 확장됨**: D5의 단일 구성 규칙에 세 번째 패턴이 추가된다 —
+*평가 bench*(예: `milestone-1h-*`)는 여러 구성을 구동하고, 구성별 자체 엔진을
+빌드하며, D4를 만족시키기 위해 sentinel 텐서를 제출한다.
+
+## First action (제일 처음에 하는 일)
+
+`kernbench.benches` 패키지가 임포트되면 `__init__.py` 가 즉시
+`_eager_import_and_audit(__path__, __name__)` 를 호출한다. 이 함수의 첫 일은
+패키지 디렉터리 안의 모든 형제 모듈을 `pkgutil.iter_modules(__path__)`로 나열한
+뒤, 다음 두 조건을 만족하지 않는 모듈을 모두 `importlib.import_module(...)`로
+**즉시 로드**하는 것이다:
+
+- 이름이 `registry` 인 경우 (인프라 자체)
+- 이름이 `_` 로 시작하는 경우 (helper 모듈)
+
+임포트 시점에 각 모듈 안의 `@bench(name=..., description=...)` 데코레이터가
+실행되어 `_PENDING` 리스트에 `(name, description, fn)` 튜플이 append 되고,
+`_REGISTERED_MODULES` 셋에 `fn.__module__` 가 추가된다.
+
+전체 임포트가 끝나면 `_audit_modules(imported, _REGISTERED_MODULES)` 가 호출되어,
+**임포트는 되었지만 @bench를 한 번도 호출하지 않은 모듈**이 있으면
+`RuntimeError("Bench module(s) missing @bench decorator: ...")` 가 즉시 발생한다.
+이 audit이 통과한 시점에 인덱스 할당은 아직 일어나지 않은 상태이며, 첫
+`list_all()` / `resolve(...)` 호출 시 `_finalize()` 가 이름 알파벳 정렬 순으로
+1-based index를 부여한다.
+
+즉, **bench 인프라의 첫 일은 "패키지 디렉터리의 모든 비-helper 모듈을 임포트
+하고, 각 모듈이 최소 한 번 @bench를 호출했는지 감사하는 것"** 이다.
+
+## Context
+
+`src/kernbench/benches/` 는 현재 8개의 bench 모듈을 보유한다 (`ccl_allreduce`,
+`gemm_single_pe`, `gpt3_qkv`, `ipcq_allreduce`, `matmul_composite`, `qkv_gemm`,
+`qkv_gemm_multi_pe`, `va_offset_verify`). 모든 bench는 다음 통합 흐름을 따른다:
+
+```
+kernbench run --topology <T> --bench <N>
+   ↓
+cli/main.py::cmd_run
+   ↓  resolve_topology(T)  + resolve(N)  + resolve_device(device_arg)
+   ↓
+runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
+   ↓  engine_factory(topology, device) → GraphEngine
+   ↓  RuntimeContext(engine, target_device, correlation_id, spec)
+   ↓
+bench_fn(ctx)        ← bench가 정의한 run(torch) 가 호출됨
+   ↓  ctx.empty/zeros/from_numpy/launch/distributed.* 등을 통해 submit
+   ↓
+ctx.wait_all()       ← 미완료 핸들이 있으면 drain
+   ↓
+BenchResult(completion, correlation_id, trace, traces, engine)
+```
+
+ADR-0010 은 CLI 표면만 다루고 (`run/list/probe/web`), ADR-0007 은 runtime API ↔
+sim_engine 책임 경계만 다룬다. 정작 "새 bench 파일을 추가하려면 어떤 모양으로
+써야 하는가"는 코드 컨벤션만으로 추적해야 한다. 결과적으로:
+
+- @bench 데코레이터의 호출 규약 (kebab-case 이름, non-empty description)이
+  코드에만 존재.
+- bench 함수 시그너처 (`def run(torch)`) 가 사실상 컨벤션인데, CLI 디스패치 측이
+  `spec.run` 을 호출한다는 사실로 강제되고 있음.
+- 신규 bench 추가자가 "helper 모듈은 `_` 접두로 분리해야 한다"는 것을 audit
+  RuntimeError를 받아본 뒤에야 학습.
+- single-device 컨벤션 (CLAUDE.md Part 2 CLI Semantics)이 bench 작성자 관점에서
+  어디까지 적용되는지 (CCL 멀티-SIP bench는 예외인가?) 명확하지 않음.
+
+이 ADR이 이런 모호함을 한 곳에 정리한다.
+
+## Decision
+
+### D1. @bench 데코레이터 계약
+
+```python
+from kernbench.benches.registry import bench
+
+@bench(name="my-bench", description="Short, complete-sentence description.")
+def run(torch):
+    ...
+```
+
+- `name`: kebab-case 문자열. 정규식 `^[a-z][a-z0-9]*(-[a-z0-9]+)*$` 통과 필요.
+  소문자/숫자/대시만 허용; 밑줄(`_`) 금지; 알파벳으로 시작.
+- `description`: non-empty 문자열 (strip 후 길이 > 0). CLI `list` 출력에 그대로
+  표시됨.
+- 데코레이터는 **fn을 변형 없이 반환**한다 — 즉 직접 호출도 가능. 부수효과로
+  `_PENDING` 에 등록만 추가한다.
+
+위 두 규칙 위반은 즉시 `ValueError`. duplicate name은 `_finalize()` 시점에
+`RuntimeError("duplicate bench name: ...")` 로 잡힌다.
+
+### D2. 모듈 파일 컨벤션
+
+`src/kernbench/benches/<slug>.py` 는 다음 중 하나여야 한다:
+
+- **bench 모듈**: 최상위 임포트 경로에서 적어도 한 번 `@bench(...)` 가 실행되어
+  최소 하나의 bench를 등록한다.
+- **helper 모듈**: 파일명이 `_` 로 시작 (예: `_shared_helpers.py`). `iter_modules`
+  순회에서 스킵된다.
+
+audit (`_audit_modules`) 는 helper가 아닌데도 @bench를 호출하지 않은 모듈을
+허용하지 않는다. 의도된 결과: 새 파일을 `benches/` 에 추가하기만 하면 자동
+등록되며, helper와의 구분은 **파일명 접두사** 하나로 명확하게 표시된다.
+
+### D3. bench 함수 시그너처는 `def run(torch)` 다
+
+데코레이터는 함수 이름을 강제하지 않지만, **CLI 디스패치는 `spec_entry.run`
+(즉 데코레이트된 callable) 을 호출**한다. 따라서 컨벤션은:
+
+- 함수 이름: `run`. 다른 이름으로 데코레이트해도 동작은 하지만 readability /
+  grep-ability 측면에서 항상 `run`.
+- 인자: 단일 위치 인자 `torch`. 실제로는 `RuntimeContext` 인스턴스이며 PyTorch
+  스타일의 namespace (zeros/empty/launch/distributed/...)를 노출한다 (ADR-0024 D3).
+- 반환값: 임의 (`Any`). 현재 `run_bench` 는 반환값을 무시하고 `ctx.handles()` /
+  `engine.get_completion()` 로 완료를 추적한다.
+
+`torch` 이름은 PyTorch 호환 idiom을 흉내내기 위함이며, 실제로 PyTorch 모듈이
+들어오는 것은 아니다 (ADR-0024 의 "rank = SIP" launcher 컨벤션과 정렬).
+
+### D4. bench는 최소 한 번의 submit을 수행해야 한다
+
+`run_bench` 는 `ctx.handles()` 가 비어 있는 경우 BenchResult.completion 을
+`ok=False, error_code="NO_REQUESTS"` 로 반환한다. 따라서 의미 있는 bench는
+다음 중 하나 이상을 호출해야 한다:
+
+- 텐서 생성 API: `torch.zeros(...)`, `torch.empty(...)` — 내부적으로
+  `MmuMapMsg` 와 (zeros 의 경우) `MemoryWriteMsg` 가 submit 됨.
+- 커널 실행 API: `torch.launch(name, fn, *args)` — `KernelLaunchMsg` 를 SIP 별로
+  submit.
+- (예외) 빈 placeholder bench: `ipcq_allreduce.py` 처럼 `print(...)` 만 하는
+  스텁은 NO_REQUESTS 결과를 받게 됨. CI 측에서 placeholder임을 인지하고 별도
+  처리하는 것을 가정한다.
+
+### D5. 단일-디바이스 컨벤션 + 멀티-SIP 예외 (ADR-0024/0027)
+
+CLAUDE.md Part 2 CLI Semantics 가 명시하는 **"benchmarks MUST remain
+single-device"** 컨벤션은 다음과 같이 해석된다:
+
+- **일반 bench (single-SIP 사용)**: `dp = DPPolicy(...)` 로 텐서 placement를
+  정의하고 `torch.launch(...)` 로 커널 발사. SIP 인덱스는 `--device` 가
+  결정한다 (CLI 측 책임).
+- **CCL bench (멀티-SIP 사용)**: 예외적으로 `torch.distributed.init_process_group
+  (backend="ahbm")` + `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` 로
+  rank = SIP 패턴 (ADR-0024 D3) 을 따른다. `--device` 는 무시되며 (또는
+  `all` 로 가정), 각 spawned worker가 `torch.ahbm.set_device(rank)` 로 자신의
+  SIP를 바인딩한다.
+
+이 두 패턴 외의 멀티-디바이스 호출 (예: 한 bench 함수가 동일 process에서 여러
+SIP을 직접 launch) 은 본 ADR이 금지한다. CLI 가 `--device all` 로 호출되어도
+bench는 한 번만 실행되며, 그 안에서 멀티-SIP을 다루려면 D5의 두 번째 패턴을
+사용한다.
+
+### D6. 이름·인덱스 해석 (`resolve`)
+
+`resolve(identifier: str)` 는 다음 순서로 BenchSpec을 반환한다:
+
+1. `identifier.isdigit()` → 정수 변환 후 `_REGISTRY` 의 entries에서 `index ==`
+   인 spec 반환. 없으면 `ValueError("No bench with index ..."`)`.
+2. `identifier in _REGISTRY` → 직접 lookup.
+3. 그 외 → `ValueError("Unknown bench ...")`.
+
+빈/공백 identifier 는 `ValueError("bench identifier must be a non-empty string.")`.
+
+CLI 는 `--bench` 의 인자를 그대로 `resolve` 에 넘긴다. 따라서 사용자는
+`kernbench run --bench gemm-single-pe` 또는 `kernbench run --bench 2` 형식 모두
+사용 가능.
+
+### D7. 인덱스는 안정 API가 아니다
+
+`_finalize()` 가 `_PENDING` 을 **이름 알파벳 정렬** 후 1-based index를 부여하므로,
+새 bench 가 추가되면 기존 bench의 index가 밀릴 수 있다. 따라서:
+
+- 사람-친화적 인터랙티브 사용: 인덱스 OK.
+- 스크립트 / CI 자동화: 반드시 이름을 사용한다.
+
+이 사실은 `registry.py` 모듈 docstring 에 명시되어 있다.
+
+### D8. RuntimeContext 가 bench에 노출하는 표면
+
+bench 함수가 `torch` 파라미터를 통해 정상적으로 사용할 수 있는 표면:
+
+- **텐서 생성**: `torch.empty(shape, dtype=..., dp=DPPolicy(...), name=...)`,
+  `torch.zeros(...)`, `torch.from_numpy(arr)`. 모두 host-side 메타 + 디바이스
+  배포 (MmuMap + MemoryWrite) 를 submit 한다.
+- **커널 발사**: `torch.launch(kernel_name, kernel_fn, *args)` —
+  `(Tensor, int, float)` 위치 인자를 `TensorArg` / `ScalarArg` 로 변환하여
+  SIP 별 `KernelLaunchMsg` 발행 후 drain.
+- **동기화**: `torch.wait(handle)`, `torch.wait_all()` (run_bench 가 자동 호출).
+- **분산**: `torch.distributed.init_process_group(backend="ahbm")`,
+  `torch.distributed.get_world_size()`, `torch.distributed.all_reduce(t, op=...)`
+  (ADR-0024/0027).
+- **멀티-프로세스 (rank=SIP)**: `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)`
+  (ADR-0024 D3 / ADR-0027).
+- **디바이스 바인딩**: `torch.ahbm.set_device(rank)` 또는
+  `torch.accelerator.set_device_index(rank)` (둘 다 같은 namespace를 가리킴).
+- **IPCQ 설치**: `torch.install_ipcq(algorithm=..., ccl_yaml=...)` (ADR-0023 D10).
+- **스펙 조회**: `torch.spec` — 토폴로지 빌더가 만든 dict (시스템·cube_mesh·HBM
+  파라미터 등). bench가 toplogy.yaml 파라미터에 의존하지 않게 짜기 위함.
+
+bench는 위에 열거되지 않은 RuntimeContext 의 private 멤버 (`_handles`, `_traces`,
+`_allocators` 등) 에 직접 접근해선 안 된다. ADR-0007 의 layer boundary 정신과
+정렬: bench → runtime API → sim_engine 한 방향만 허용.
+
+### D9. 환경 변수로 파라미터화는 허용된다
+
+`matmul_composite.py` 처럼 `os.environ.get("MATMUL_M", ...)` 등으로 bench
+파라미터를 외부에서 주입하는 패턴은 허용한다. 이유:
+
+- bench 함수 시그너처는 D3 에 의해 `def run(torch)` 로 고정되어 있어 위치/키워드
+  인자로 파라미터를 받기 곤란.
+- 환경 변수 패턴은 `MATMUL_VARIANT` 같은 운영-시 스윕을 위한 자연스러운 hook.
+- `scripts/gemm_sweep.py` 같은 외부 드라이버 (ADR-0044) 가 이 hook을 사용한다.
+
+단, 환경 변수가 bench의 동작을 바꾼다면 모듈 docstring 에 모든 변수를 명시할 것
+(matmul_composite.py 가 그 예시).
+
+## Alternatives Considered
+
+### A1. 명시적 manifest 파일 (YAML)에 bench 목록 두기
+
+기각. @bench 데코레이터 + audit 패턴은 "파일 추가 = 자동 등록" 을 보장하여 신규
+bench 작성자의 인지 비용을 한 곳 (파일 작성)으로 집중시킨다. 별도 manifest는
+유지보수 측에서 drift 위험이 크고, helper 분리는 이미 `_` 접두로 명확하다.
+
+### A2. bench 함수 이름을 데코레이터 인자로 받기 (`@bench(name=..., entry="run_xxx")`)
+
+기각. 디스패치 측에서 `spec.run` 하나만 호출하면 되는 단순함을 깬다. `run` 컨벤션
+하나로 충분하며, 변종이 필요하면 같은 모듈에 여러 함수를 등록하면 된다 (각각
+@bench 데코레이트).
+
+### A3. CCL bench를 위한 별도 `@multi_device_bench` 데코레이터
+
+기각. D5에서 명시한 두 패턴 (single + ADR-0024 멀티-SIP) 만으로 현재 8개 bench가
+모두 표현 가능. 별도 데코레이터는 디스패치 측에서 분기를 강제하여 복잡도를 늘리며,
+멀티-SIP 사용 의도는 bench 함수 본문의 `init_process_group(...)` 호출로 충분히
+드러난다.
+
+### A4. 인덱스를 안정 API로 만들기 (등록 순서 / explicit index= 인자)
+
+기각. D7에서 명시한 trade-off — 사용자 친화성 (알파벳 정렬된 인덱스가 list 출력
+에서 자연스럽게 1, 2, 3...) 우선. 스크립트는 이름으로 지정하면 충분.
+
+## Consequences
+
+- "bench 추가 방법" 이 한 ADR로 정리됨 → 신규 작성자가 코드 grep 없이 D1-D3,
+  D8 만 따르면 됨.
+- helper 모듈을 `_` 접두로 분리하는 패턴이 ADR-level에서 정당화되어, 향후
+  `benches/_*.py` 식의 공유 helper 작성이 자유로워짐.
+- CLAUDE.md Part 2 CLI Semantics 의 single-device 컨벤션이 멀티-SIP CCL bench
+  와 모순되지 않음을 D5 가 명시 — 둘은 직교한다.
+- ADR-0044 (GEMM eval harness) 의 `scripts/gemm_sweep.py` 가 환경 변수 hook을
+  사용하는 근거 (D9) 가 본 ADR에 굳어짐.
+- 인덱스가 불안정함 (D7) 이 명시되어, CI 측 `kernbench run --bench 3` 같은
+  코드는 본 ADR 수락 직후 점검 대상.
@@ -0,0 +1,307 @@
+# ADR-0046: TLContext — Kernel-side `tl.*` API Contract
+
+## Status
+
+Accepted (2026-05-22).
+
+`src/kernbench/triton_emu/` 의 `TLContext` 가 노출하는 `tl.*` primitive
+집합과 그 의미, 그리고 두 실행 모드 (command-list / greenlet runner) 의
+계약을 명시한다. ADR-0014/0020 가 PE 파이프라인과 2-pass 실행 모델을
+정의하나, **bench 의 kernel 함수가 호출하는 `tl.*` 표면 자체**는 ADR-level
+에 정리되어 있지 않았다.
+
+## First action (제일 처음에 하는 일)
+
+`TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id, num_cubes,
+scratch_base, scratch_size)` 생성 시 가장 먼저 다음 6개 필드를 초기화한다:
+
+- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
+  `tl.program_id` / `tl.num_programs` 가 반환할 값.
+- `self._dispatch_cycles` — 모든 `tl.*` API 호출 시작에서 자동으로 발행될
+  `PeCpuOverheadCmd(cycles)` 의 cycle 수.
+- `self._runner` — `KernelRunner` 인스턴스 (있으면 greenlet 모드, 없으면
+  command-list 모드).
+- `self._commands: list[PeCommand] = []` — command-list 모드에서 누적할
+  command 시퀀스.
+- `self._handle_counter = 0`, `self._completion_counter = 0` — 새 TensorHandle /
+  CompletionHandle id 생성용.
+- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
+  PE-로컬 scratch 영역 (math/dot/composite 의 output handle 주소 할당용).
+
+즉, **TLContext 의 첫 일은 "이 kernel 인스턴스가 어디서 (sip/cube/pe) 어떤
+규모 (num_programs/num_cubes) 로 실행되며, 어느 모드 (runner 유무) 로
+명령을 발사할지 메타데이터를 채우는 것"** 이다. 이 시점에 SimPy event 는
+없으며 command 도 발사되지 않는다.
+
+런타임 첫 동작은 kernel 함수가 `tl.<api>()` 를 처음 호출할 때 발생한다.
+모든 `tl.*` API 의 표준 entry 동작은:
+
+1. `self._emit_dispatch_overhead()` 호출 — `dispatch_cycles > 0` 인 경우
+   `PeCpuOverheadCmd(dispatch_cycles)` 를 즉시 `_emit`.
+2. API 별 처리 (TensorHandle 생성, command 구성).
+3. `self._emit(cmd)` — runner 모드면 greenlet.switch 로 SimPy 측에 cmd 전달,
+   아니면 `self._commands` 에 append.
+
+## Context
+
+`tl.*` 표면은 `TLContext` 가 노출하는 메소드들로 구성되며, kernel 함수가
+받는 `tl` 매개변수가 이 객체다. 사용자(bench 작성자) 입장에서 보이는
+contract:
+
+- 어떤 primitive 가 있는가
+- 각 primitive 가 어떤 데이터 흐름을 발생시키는가 (DMA / compute / IPCQ /
+  metadata-only)
+- TensorHandle 의 `space` 와 `addr` 가 어떻게 결정되는가
+- command-list 모드와 greenlet 모드의 차이
+
+ADR-0014 (PE pipeline) 가 PE_SCHEDULER 가 받는 PeCommand 들을 정의하나,
+`tl.*` 가 이들을 어떻게 emit 하는지는 코드 컨벤션에만 존재한다. 또한
+ADR-0020 (2-pass data execution) 가 greenlet 모드의 존재를 D3 에서
+언급하나, runner / non-runner 두 경로의 시그너처 차이 (return value 처리)
+는 ADR-level 에 명시되어 있지 않다. 이 ADR 이 그 빈자리를 채운다.
+
+## Decision
+
+### D1. `tl` 매개변수는 `TLContext` 인스턴스다
+
+bench 의 kernel 함수는 다음 시그너처를 따른다:
+
+```python
+def _kernel(arg1, arg2, ..., tl, **kwargs):
+    ...
+```
+
+`tl` 의 정체는 `kernbench.triton_emu.tl_context.TLContext` 인스턴스이다.
+real Triton 의 `triton.language` 모듈을 흉내내기 위한 이름이며, real
+Triton 모듈이 들어오는 것은 아니다.
+
+kernel 함수는 일반 Python 함수이며 `yield` / `async` 가 없다. `tl.*`
+호출이 SimPy event 를 발생시키지만, 호출자(kernel) 쪽에서는 동기 호출처럼
+보인다 — greenlet 모드에서 KernelRunner 가 SimPy ↔ kernel 사이를 중계
+하기 때문 (ADR-0020 D3).
+
+### D2. 두 실행 모드 — command-list / greenlet runner
+
+- **command-list 모드 (`runner is None`)**: `tl.*` 호출이 `self._commands`
+  리스트에 PeCommand 를 누적. DMA / GEMM / Math 가 실제 SimPy 시간을
+  소비하지 않으며, return value 가 metadata-only TensorHandle (data=None) 다.
+  이후 PE_SCHEDULER / sim_engine 가 command 시퀀스를 시간상 재생.
+
+- **greenlet runner 모드 (`runner is not None`)**: `tl.*` 호출이
+  `self._emit(cmd)` 를 통해 `runner.switch_to_simpy(cmd)` 로 부모 greenlet
+  (SimPy) 으로 컨트롤을 넘김. 부모는 cmd 를 컴포넌트에 분배하여 SimPy 시간을
+  소비한 뒤, DMA read 의 경우 실제 numpy 데이터를 반환. kernel 은 그
+  결과를 받아 다음 line 으로 진행 (ADR-0020 D3 의 데이터 인지 실행 모델).
+
+mode 선택은 KernelRunner 인스턴스를 TLContext 에 주입하는지 여부로 결정
+되며, `tl.*` 메소드들은 이 차이를 인지하지 않고 `_emit()` 헬퍼를 통해
+일관되게 동작한다.
+
+### D3. Primitive 카테고리
+
+#### D3.1. Reference (no DMA, metadata only)
+
+- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: HBM 데이터를 참조하는
+  핸들만 만들고 DMA 는 발행하지 않음. composite scheduler 가 per-tile 로
+  스트리밍할 때 사용 (예: GEMM 의 b 피연산자).
+
+#### D3.2. Data movement (blocking, DMA engine)
+
+- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → 결과 핸들.
+  `DmaReadCmd` 발행. greenlet 모드에서는 결과 핸들의 `.data` 에 실제
+  numpy 배열 첨부; command-list 모드에서는 placeholder. 반환 핸들의
+  `space="hbm"`, `pinned=True`.
+- `tl.store(ptr, handle) -> None`: TCM → HBM. `DmaWriteCmd` 발행. greenlet
+  모드에서는 `handle.data` 가 있을 때만 `_store.write("hbm", ptr, data)` 를
+  먼저 호출 (visibility = issue time, ADR-0020 D3).
+
+#### D3.3. GEMM / compute (blocking)
+
+- `tl.dot(a, b) -> TensorHandle`: `a @ b`. 두 피연산자는 TCM 이어야 하며,
+  shape (M,K) × (K,N) → (M,N). `GemmCmd` 발행, output handle 은
+  `_make_compute_out(shape, dtype)` 로 PE-로컬 scratch 에 할당.
+- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
+  acc_dtype=None, tile_shape=None) -> CompletionHandle`: 비차단(non-blocking)
+  tiled pipeline. `CompositeCmd` 발행. `epilogue` 는 dict list, 각 dict 는
+  `"op"` 키 + op-specific 필드 + 옵션 `"scope"` (k_tile / output_tile);
+  unknown op 나 missing field 는 즉시 ValueError. 반환된 CompletionHandle 은
+  `tl.wait(h)` 로 동기화.
+
+#### D3.4. Math: unary (blocking)
+
+- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
+  `tl.cos(x)`, `tl.sin(x)` — 모두 `MathCmd(op=<name>, inputs=(x,), out=)`
+  발행. `out` 은 동일 shape/dtype 의 scratch 할당.
+
+#### D3.5. Math: binary (blocking)
+
+- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
+- `tl.fma(a, b, c)` — `a*b + c`. inputs 3개.
+- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
+- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
+- `tl.softmax(x, axis=-1)` — 단일 MathCmd(op="softmax") 로 시간 회계는
+  한 번에. Phase 2 DataExecutor 가 canonical (x-max → exp → sum → div) 로
+  expand 한다.
+
+#### D3.6. Reduction (blocking)
+
+- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — 해당 axis 의
+  크기를 1 로 줄인 output handle 을 반환. `MathCmd(op=<name>, inputs=(x,),
+  out=, axis=axis)` 발행.
+
+#### D3.7. Index / scalar (PE_CPU, no engine)
+
+- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE 인덱스),
+  `axis==1` → cube_id (ADR-0022).
+- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (cube 당
+  PE 수), `axis==1` → num_cubes.
+- `tl.arange(start, end, dtype="i32") -> TensorHandle`: TCM 의 인덱스
+  range. command 발사 없이 metadata 만.
+- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape, value,
+  dtype="f16") -> TensorHandle`: TCM 에 placeholder. command 발사 없음.
+
+#### D3.8. Scalar helpers (no command, no engine)
+
+- `TLContext.cdiv(a, b) -> int` (static): ceiling division
+  `-(-a // b)`. real Triton 의 `tl.cdiv` 모방.
+
+#### D3.9. Metadata-only (no compute, no DMA)
+
+- `tl.trans(x) -> TensorHandle`: shape 의 마지막 두 dim 을 swap 한 새
+  핸들. 같은 addr/data 를 공유, command 발사 없음.
+
+#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
+
+- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
+  dtype="f16", space="tcm") -> None`: blocking send. handle 형태 또는
+  raw 주소 형태 둘 다 허용. `IpcqSendCmd` 발행. handle 의 `.data` 스냅샷이
+  명령에 실리는 경우, recv 측에서 받은 데이터의 race 회피.
+- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
+  dst_space=None) -> TensorHandle`: blocking recv. `dst_addr/dst_space`
+  둘 다 주면 "copy_to_dst" 모드, 아니면 "return_slot" 모드. greenlet
+  모드에서 핸들의 `.data` 에 실제 데이터 첨부.
+- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
+  **DIAGNOSTIC ONLY**. recv blocking 동기화는 그대로 적용되나 slot-read
+  latency (slot-IO + PE↔bank fabric drain) 는 건너뛴다. pe2pe overview
+  플롯에서 `tl.store` 와의 apples-to-apples 비교용. production kernel 은
+  사용 금지 — `consume=False` 라는 별도 명령 분기로 격리되어 있어 실수
+  flag 가 작동하지 않는다.
+- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
+  recv. `RecvFuture` 를 반환; 이후 `tl.wait(future)` 로 결과 수령.
+
+#### D3.11. Composite + control
+
+- `tl.composite(...)`: D3.3 에서 설명.
+- `tl.wait(handle=None)`: `CompletionHandle` (composite) 또는 `RecvFuture`
+  (async recv) 또는 `None` (모든 pending composite) 대기.
+- `tl.cycles(n)`: PE_CPU scalar 실행 overhead 를 명시적으로 선언.
+  `PeCpuOverheadCmd(cycles=n)` 발행.
+
+### D4. TensorHandle 산술 연산자 — thread-local TLContext
+
+`tl_context.py` 모듈 로드 시점에 `_enable_tensor_ops()` 가 호출되어
+`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__` 를 patch한다.
+각 연산자는 thread-local `_ctx` (모듈 변수) 에 저장된 active TLContext 의
+`_binary_math` 를 호출한다.
+
+따라서 kernel 안에서 `c = a + b` 는 `MathCmd(op="add", inputs=(a,b),
+out=)` 발행 + new TensorHandle 반환 패턴과 동일하다.
+
+active TLContext 관리:
+
+- `TLContext._set_active(ctx)`: 현재 thread/greenlet 의 active ctx 설정.
+- `TLContext._get_active()`: 조회 (없으면 RuntimeError).
+- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. 진입 시
+  active 설정, kernel 실행, 종료 시 None 으로 복원.
+
+`KernelRunner` 는 매 cmd 분배 시 `_switch_kernel` 가 직접 `_set_active(tl)`
+를 호출하여, 같은 thread 안의 다른 PE runner 가 active 를 덮어쓴 경우에도
+복원되도록 한다.
+
+### D5. Scratch allocator — compute output handles
+
+`tl.dot`, `tl.exp`, `tl.add` (TensorHandle `__add__`) 등 결과를 만드는 op 는
+`_make_compute_out(shape, dtype)` 를 호출하여 16-byte aligned scratch
+주소를 할당한다. 이 주소는 `space="tcm"` 로 발행되며, 이후 `tl.send` /
+`tl.store` 가 이 handle 을 source 로 사용할 수 있다.
+
+`_scratch_base == 0` (command-list 모드 등) 이면 할당 주소가 0으로
+반환되어 handle 은 send/store 의 source 로 사용 불가 (이 경우 `tl.load`
+로 받은 핸들만 source 가 될 수 있다).
+
+cursor 가 `_scratch_size` (default 1 MiB) 를 초과하면 RuntimeError.
+cursor 는 매 kernel invocation 시작 시 0 으로 리셋되어야 하나 (현재 코드는
+KernelRunner 가 새 TLContext 를 매번 생성하여 자연스럽게 리셋됨).
+
+### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
+
+모든 non-metadata `tl.*` 호출의 entry 에서 `_emit_dispatch_overhead()` 가
+호출되며 `dispatch_cycles > 0` 일 때 `PeCpuOverheadCmd(dispatch_cycles)`
+를 발행한다. PE_CPU 가 명령 dispatch 자체에 소비하는 cycle 비용을
+모델링하기 위함이다.
+
+기본값:
+
+- `TLContext.__init__` 의 `dispatch_cycles` 매개변수 기본값: 1 cycle.
+- `KernelRunner` 가 만드는 TLContext: 0 cycles (greenlet 모드는 cycle
+  회계가 별도, ADR-0020 D3 정신).
+
+### D7. Kernel registry (`triton_emu/registry.py`)
+
+별도의 `_kernels: dict[str, Callable]` 가 kernel 이름 → 함수 매핑을 보유:
+
+- `register_kernel(name, fn)`: duplicate 등록 시 ValueError.
+- `get_kernel(name)`: 미등록 시 KeyError.
+- `clear_registry()`: 테스트 전용.
+
+`RuntimeContext.launch(kernel_name, kernel_fn, *args)` 가 매 호출마다
+`_kernels[kernel_name] = kernel_fn` 으로 idempotent 덮어쓴다 (last call
+wins). 이는 ADR-0045 D8 의 launch 동작과 정합된다.
+
+PE_CPU 는 `KernelRef.name` 으로 registry 에서 kernel 함수를 lookup 한 뒤
+KernelRunner 로 실행한다.
+
+## Alternatives Considered
+
+### A1. tl.* 를 ADR-0014 / ADR-0020 안으로 통합
+
+기각. ADR-0014 는 PE pipeline (PeCommand 의 sim_engine 측 소비) 를, ADR-0020
+은 2-pass 실행 (Phase 1 timing / Phase 2 data) 을 다룬다. `tl.*` 는 kernel
+작성자가 만나는 API 표면이라 독립 분리하는 것이 검색성·온보딩 측면에서
+낫다.
+
+### A2. command-list 모드 deprecation
+
+기각 (현재). 단순한 unit test 와 kernel verification 에서 command-list
+모드가 가볍게 동작한다. greenlet 의존성 없이 PeCommand 시퀀스를 검사할 수
+있는 출입구로 유지한다. greenlet 모드만의 의미 (실데이터, Phase 2) 가
+필요하면 D2 의 mode 선택으로 명시적으로 들어간다.
+
+### A3. TensorHandle 산술 연산자 제거
+
+기각. real Triton 의 kernel 코드 가독성을 흉내내기 위함이며 (예: `c = a +
+b`), thread-local active ctx 패턴이 깔끔하게 작동 중. 명시적 `tl.add(a, b)`
+도 D3.5 에 노출되어 있어, 연산자가 헷갈리면 함수형 호출로 대체 가능.
+
+### A4. softmax 를 명시적 시퀀스 (max → exp → sum → div) 로 expand
+
+부분 채택. `tl.softmax` 는 단일 `MathCmd(op="softmax")` 로 timing 회계는
+한 번에 처리한다 (D3.5). 실 데이터 expansion 은 Phase 2 DataExecutor 가
+canonical 시퀀스로 풀어준다. 즉, 시간 모델은 atomic, 데이터 모델은
+expansion — 두 마리 토끼를 의도적으로 분리.
+
+## Consequences
+
+- bench 작성자가 만나는 모든 `tl.*` primitive 가 한 ADR 에 분류·정의됨.
+  ADR-0045 D8 의 host-side surface (torch.empty 등) 와 짝을 이루어 "kernel
+  안 / 밖" 양쪽 작성 가이드가 완성.
+- command-list / greenlet 두 모드의 차이가 D2 에 명시되어, 새로운 `tl.*`
+  primitive 추가 시 `_emit()` 패턴만 따르면 양쪽 자동 호환됨.
+- thread-local active ctx 패턴 (D4) 이 ADR-level 에서 정당화되어, 향후
+  multi-PE 동일-thread 실행 시 reset 책임이 어디인지 명확해짐
+  (`_switch_kernel` 가 cmd 분배 시 active 복원 — KernelRunner.run 의
+  contract).
+- `tl.recv_no_consume` 의 진단 전용 격리(D3.10) 가 ADR 에 굳어져, 실수로
+  production kernel 에서 사용되는 것을 막는 layer 가 명확.
+- registry (D7) 가 별도 D 항목으로 분리되어, kernel 이름 충돌 / 동적
+  재등록 동작의 사양이 명시.
@@ -0,0 +1,243 @@
+# ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
+
+## Status
+
+Accepted (2026-05-22).
+
+`runtime_api/distributed.py` 의 `AhbmCCLBackend` + `DistributedContext` —
+즉 `torch.distributed.init_process_group(backend="ahbm")` 진입점이 실제로
+무엇을 설치하고 어떤 의미로 `all_reduce`/`barrier`/`get_rank` 등을
+구현하는지를 명시한다. ADR-0023 D11 이 "torch.distributed compatibility"
+의도를 언급하나, **backend 자체의 동작 모델**은 ADR-level 에 없었다.
+
+## First action (제일 처음에 하는 일)
+
+`RuntimeContext.__post_init__` 가 자동으로 `DistributedContext()` 인스턴스를
+만들어 `self.distributed` 에 attach 한다. 그 시점의 첫 일은:
+
+1. `self._backend: AhbmCCLBackend | None = None` 으로 초기화 (아직 init
+   되지 않은 상태).
+2. `self._rank_by_greenlet: dict = {}` 로 greenlet-local rank 레지스트리
+   초기화 (ADR-0024 D2).
+3. 호출자(RuntimeContext) 측에서 `dc._ctx_ref = self` 로 back-reference 를
+   심어, 이후 `init_process_group` 가 `ctx.engine` / `ctx.spec` / `ctx.launch`
+   에 도달할 수 있게 한다.
+
+즉, **DistributedContext 의 첫 일은 "RuntimeContext 에 자기 자신을
+back-reference 와 함께 부착하고 backend 슬롯을 비워두는 것"**. 실제 backend
+설치(IPCQ install, world_size 산출, 알고리즘 모듈 로드)는 사용자 코드의
+`torch.distributed.init_process_group(backend="ahbm")` 호출 시점에 비로소
+일어난다.
+
+해당 시점의 `init_process_group` 의 첫 일은:
+
+1. `backend != "ahbm"` 이면 즉시 `ValueError("Unsupported backend ...")`.
+2. `getattr(self, "_ctx_ref", None)` 가 None 이면
+   `RuntimeError("DistributedContext not bound to a RuntimeContext")`.
+3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — 이 생성자 안에서
+   ccl.yaml load + 알고리즘 모듈 import + world_size 산출 + SFR 설정 +
+   IPCQ install 이 모두 일어난다.
+4. `self._backend._dist_ctx = self` — backend 가 거꾸로
+   `_rank_by_greenlet` 에 접근할 수 있게 함.
+
+## Context
+
+PyTorch DDP 의 collective 호출 (`init_process_group`, `all_reduce` 등) 을
+그대로 사용할 수 있게 만들어, bench 코드가 "진짜 DDP training script" 와
+동일한 모습이 되도록 하는 것이 `AhbmCCLBackend` 의 목적이다 (ADR-0024 +
+ADR-0027 의 launcher 모델과 정렬).
+
+이 backend 가 책임지는 것:
+
+- `init_process_group` 시점에 **IPCQ neighbor table 을 한 번 설치** (real
+  NCCL communicator creation 과 유사).
+- `all_reduce(tensor, op="sum")` 호출 시 **설정된 algorithm 의 kernel 함수
+  를 `ctx.launch(...)` 로 발사**.
+- `get_world_size` / `get_rank` 를 greenlet-local rank 레지스트리와
+  ccl.yaml/topology 로부터 일관되게 답함.
+
+ADR-0023 D10 (IPCQ install plan), ADR-0024 (SIP launcher) 가 부분적으로
+이를 다루나, **`AhbmCCLBackend` 자체의 책임 범위와 의사결정 순서**는
+어디에도 명시되어 있지 않다. 본 ADR 이 채운다.
+
+## Decision
+
+### D1. backend 는 `init_process_group(backend="ahbm")` 시점에만 생성된다
+
+`DistributedContext` 는 `__init__` 시점에 `_backend = None` 으로 시작한다.
+backend 객체는 사용자가 `dist.init_process_group(backend="ahbm")` 를
+호출하기 전까지 존재하지 않으며, 그 외 API (`is_initialized`,
+`get_world_size`, `all_reduce`, `barrier`) 가 backend 가 None 인 채로
+호출되면 `RuntimeError("Default process group has not been initialized...")`
+를 던진다 (`_ensure_initialized` 헬퍼).
+
+`backend != "ahbm"` 은 즉시 `ValueError`. 다른 backend 명 (nccl, gloo
+등) 은 인식하지 않는다.
+
+### D2. world_size 산출 우선순위 — algorithm > defaults > topology
+
+`AhbmCCLBackend._resolve_world_size` (ADR-0024 D1) 의 결정 순서:
+
+1. `ccl.yaml` 의 algorithm entry 에 `world_size` 가 있으면 그 값.
+2. `defaults.world_size` 가 있으면 그 값.
+3. 둘 다 없으면 `spec.system.sips.count` (=topology 의 SIP 개수).
+
+기본 의미는 **rank = SIP** (ADR-0024). cube/PE-level parallelism 은 각
+rank 안에서 DPPolicy 로 표현되며 world_size 에 영향을 주지 않는다. 명시적
+`ccl.yaml` 의 world_size override 가 있으면 legacy "rank = flat PE 인덱스"
+테스트 경로를 위해 그대로 존중된다.
+
+`init_process_group(world_size=..., rank=...)` 의 사용자 인자는 **수신하나
+무시**된다 (real PyTorch 의 `RANK` / `WORLD_SIZE` env var 와 같은 의미).
+
+### D3. `init_process_group` 가 즉시 하는 4가지 설치 작업
+
+`AhbmCCLBackend.__init__` 안에서 다음이 순차 실행된다:
+
+1. **ccl.yaml 로딩**: `kernbench.ccl.install.load_ccl_config()` →
+   `resolve_algorithm_config(_cfg_all)` 로 `defaults.algorithm` (또는
+   사용자가 지정한 알고리즘) 의 merged config 산출.
+2. **알고리즘 모듈 import**: `importlib.import_module(self._merged["module"])`.
+   이 모듈은 `kernel` 함수, `kernel_args(world_size, n_elem, cube_w, cube_h)`
+   helper, optional `TOPO_NAME_TO_KIND` 매핑을 노출해야 한다.
+3. **world_size 산출** (D2).
+4. **topology 메타 수집**: `spec` 으로부터 `n_sips`, `sip_topo` (`ring_1d`
+   기본), `cube_w`/`cube_h`, `sips.w`/`sips.h`. SIP topology 가 ring_1d 가
+   아니면 explicit `w`/`h` 또는 square root 로 (`w*h == n_sips` 보장)
+   `_sip_topo_w/h` 산출. 불일치 시 `ValueError`.
+5. **SFR + IPCQ 설치**: `kernbench.ccl.sfr_config.configure_sfr_intercube_multisip
+   (engine, spec, self._merged)` 를 호출. 이 함수가 모든 SIP/cube 의 pe0 에
+   IPCQ neighbor table 을 푸시 (real NCCL communicator 의 일회성 설정에
+   해당).
+
+이 순서가 변하면 (예: SFR 전에 algorithm 모듈 load 가 실패하면) 부분 초기화
+상태가 발생할 수 있다. 따라서 D3 는 atomic 한 4-단계로 본다 — 실패 시
+backend 는 미설치 상태로 남는다.
+
+### D4. greenlet-local rank 등록 (ADR-0024 D2)
+
+`DistributedContext._rank_by_greenlet: dict[greenlet, int]` 은 spawn 된
+worker greenlet 각각에 rank 를 매핑한다. bench launcher (예:
+`torch.multiprocessing.spawn`) 가 worker 를 띄울 때
+`dc._bind_rank(g, rank)` 를 호출하여 등록한다.
+
+`get_rank()` 는 `getcurrent()` 의 greenlet 을 lookup. 미등록 greenlet은
+fallback 으로 0 을 반환 — single-driver / 테스트 호환성 유지.
+
+backend 는 `_dist_ctx._rank_by_greenlet` 를 통해 `all_reduce` 시 현재
+greenlet 의 rank 를 가져온다 (D5).
+
+### D5. `all_reduce(tensor, op="sum")` 동작
+
+검증 단계:
+
+- `op != "sum"` → `NotImplementedError`. 현재 kernel 들은 add reduction만 구현.
+- `tensor._handle is None` → `RuntimeError("not deployed")`.
+- `tensor._handle.shards` 가 비면 `RuntimeError("no shards")`.
+
+준비 단계:
+
+- `n_elem = shards[0].nbytes // tensor.itemsize` — 단일 shard 의 element 수.
+- `kernel_fn = self._algo_module.kernel` — D3 에서 import 된 알고리즘 모듈의
+  진입 함수.
+- effective cube dims 결정: 첫 번째 SIP 의 cube 갯수가 1 이면 (1,1) 으로
+  scalar 처리, 아니면 토폴로지의 `cube_w`/`cube_h` 사용. TP 가 일부 cube
+  만 쓰는 경우를 자연스럽게 흡수.
+- `kernel_args = self._algo_module.kernel_args(world_size, n_elem, cube_w,
+  cube_h)` — 알고리즘이 자기 kernel 에 넘길 인자 셋을 결정.
+
+dispatch:
+
+- 현재 greenlet 의 rank 를 `_rank_by_greenlet.get(g, 0)` 로 lookup.
+- `extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` 를 append.
+- `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor, *kernel_args,
+  *extra_args, _defer_wait=True)` — `_defer_wait=True` 로 collective drain
+  을 메인 scheduler 에 위임 (ADR-0027 D0.4).
+
+drain:
+
+- 부모 greenlet 이 살아있으면 (multi-greenlet 모드) `_pending_collective_handles`
+  에 enqueue 한 뒤 부모로 switch. 메인 scheduler 가 모든 rank 의 launch 후
+  일괄 drain.
+- 단일-driver 모드면 inline 으로 `for h, _sip_id, meta in pending:
+  self.ctx.wait(h, _meta=meta)` 즉시 drain.
+
+### D6. `barrier()` 는 no-op 이다 (single-driver 모델)
+
+kernbench 는 하나의 Python process 안에서 모든 rank 를 greenlet 으로 다룬다.
+process 간 동기화가 필요한 상황이 없으므로 `barrier()` 는 호출 가능하지만
+실제 어떤 동기화도 수행하지 않는다. real PyTorch DDP 와의 API 호환성을
+위해 유지 (호출자가 NotImplementedError 를 받지 않도록).
+
+장래에 multi-process kernbench (예: SimPy event loop 가 process 별로
+독립) 가 도입되면 D6 를 supersede 하는 새 ADR 이 필요.
+
+### D7. `get_rank` / `get_world_size` / `get_backend` 의 의미
+
+- `get_rank()` (D4): 현재 greenlet 의 bound rank. 미등록은 0.
+- `get_world_size()` (D2): backend 가 D3 에서 산출한 world_size.
+- `get_backend()`: 항상 `"ahbm"` 문자열. backend 객체가 존재하지 않으면
+  `_ensure_initialized` 에서 RuntimeError.
+
+real PyTorch 와의 차이:
+
+- real PyTorch `get_rank()` 는 process global 값이지만, kernbench 는
+  greenlet-local. spawn 된 worker 안에서 호출하면 rank, main thread 에서
+  호출하면 0. bench 작성자는 worker 함수 안에서만 의미 있는 rank 를 기대해야
+  한다.
+
+### D8. 지원하는 API 표면 (final)
+
+`DistributedContext` 가 노출하는 API:
+
+- `init_process_group(backend="ahbm", world_size=None, rank=None, **kwargs)`
+- `is_initialized() -> bool`
+- `get_world_size() -> int`
+- `get_rank() -> int`
+- `get_backend() -> str`
+- `all_reduce(tensor, op="sum") -> None`
+- `barrier() -> None`
+- (internal) `_bind_rank(g, rank)`
+
+이외의 PyTorch distributed API (broadcast, reduce, all_gather, gather,
+scatter, send/recv 등) 는 **아직 구현되어 있지 않다**. kernel 레벨에서는
+`tl.send`/`tl.recv` (ADR-0046 D3.10) 로 직접 표현 가능하나, dist.* surface
+로는 노출되지 않는다. 추가 collective 가 필요해질 시 별도 알고리즘 모듈
+ `DistributedContext` 메소드 한 쌍을 추가하여 D8 를 확장한다.
+
+## Alternatives Considered
+
+### A1. backend 를 `RuntimeContext.__init__` 에서 즉시 생성
+
+기각. ccl.yaml 이 없거나 알고리즘 모듈을 import 할 수 없는 경우, bench 가
+distributed 기능을 안 쓰는데도 RuntimeContext 생성 자체가 실패하게 된다.
+"호출 시점에 비로소 설치" (D1) 가 lazy 의미상 옳다.
+
+### A2. world_size 를 항상 topology 로부터 자동 산출 (override 금지)
+
+기각. ADR-0024 D1 의 "explicit override" 경로가 legacy 테스트에서 사용 중.
+한 SIP 안에서 PE-level rank 를 따로 정의해야 하는 진단 시나리오를 위해
+유지.
+
+### A3. `op != "sum"` 을 silent fallback 으로 처리
+
+기각. 사용자가 `op="prod"` / `"max"` / `"avg"` 를 의도했는데 silently sum
+이 실행되면 결과 검증이 매우 어렵다. 명시적 `NotImplementedError` 가 안전.
+
+### A4. `barrier` 를 SimPy event 로 구현
+
+기각 (현재). single-driver 모델에서 cross-process 동기화 의미가 없으므로
+no-op 가 의미적으로 정확. SimPy fake-barrier 는 의미 없이 코드 복잡도만
+높임. multi-process kernbench 도입 시 재평가.
+
+## Consequences
+
+- `torch.distributed.init_process_group(backend="ahbm")` 의 4-단계 설치
+  (D3) 가 ADR-level 에서 굳어져, 향후 새 collective 알고리즘이 어디에
+  훅을 걸어야 하는지 명확.
+- D2 의 우선순위 (algorithm > defaults > topology) 가 명시되어, ccl.yaml
+  변경 시 영향 범위를 빠르게 가늠 가능.
+- D6 의 barrier no-op 결정이 ADR-level 에 굳어져, multi-process kernbench
+  도입 시 별도 ADR 로 supersede 해야 함이 분명.
+- D8 의 미지원 API 목록이 명시되어, 사용자가 `dist.broadcast(...)` 를
+  호출하려 할 때의 명확한 거절 근거 제공.
@@ -0,0 +1,262 @@
+# ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
+
+## Status
+
+Accepted (2026-05-22).
+
+`policy/address/allocator.py` 의 `_FreeList` / `PEMemAllocator` 와
+`va_allocator.py` 의 `VirtualAllocator` 가 사용하는 free-list 알고리즘,
+페이지 정렬, coalescing 규칙을 명시한다. ADR-0001 (PhysAddr 레이아웃) 과
+ADR-0011 (PA/VA/LA 모델) 이 주소 스킴을 정의하나, **할당 알고리즘**은 별도
+ADR 이 없었다.
+
+## First action (제일 처음에 하는 일)
+
+### `_FreeList(capacity)`
+
+생성 즉시 `self._capacity = capacity`, `self._used = 0`, `self._free =
+[(0, capacity)]` 로 초기화. 첫 일은 **전 영역을 single free block 으로
+세우는 것** — 즉 `(offset=0, size=capacity)` 한 튜플이 free list 의 유일한
+원소다.
+
+### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
+
+생성 즉시 두 개의 `_FreeList` 를 만든다:
+
+- `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — 이 PE 가 소유한 HBM
+  slice 의 바이트 크기 (`hbm_bytes_per_cube // hbm_slices_per_cube`) 만큼.
+- `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — `tcm_bytes_per_pe -
+  tcm_scheduler_reserved_bytes` 만큼 (scheduler 예약분은 사전 분리).
+
+따라서 PEMemAllocator 의 첫 일은 **이 PE 의 HBM slice 와 사용자
+TCM 영역을 각각 단일 free block 으로 세우는 것**.
+
+### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
+
+생성 즉시 `self._va_base = va_base`, `self._va_size = va_size`,
+`self._page_size = page_size`, `self._used = 0`, `self._free = [(va_base,
+va_size)]`. 첫 일은 **VA base 부터 size 까지 single block 으로 세우고
+page_size 를 회수**.
+
+## Context
+
+`runtime_api/context.py::_ensure_allocators` 는 다음 단계로 allocator 세트를
+구성한다:
+
+1. spec 으로부터 `hbm_total_gb_per_cube`, `hbm_slices_per_cube`,
+   `tcm_size_mb`, target_device 별 SIP 범위 등을 읽음.
+2. `AddressConfig` 로 모든 파라미터를 frozen 하게 패킹.
+3. target SIP 범위 × cube × PE 의 모든 조합에 대해
+   `PEMemAllocator(sip, cube, pe, cfg)` 인스턴스를 1개씩 생성.
+4. `VirtualAllocator(va_base=0x1_0000_0000, va_size=64 GiB,
+   page_size=pe_mmu.page_size)` 를 1개 생성.
+
+allocator 들의 책임:
+
+- **PEMemAllocator**: PE-로컬 HBM slice / TCM 의 PA-공간 할당 (PhysAddr
+  encoding 까지 포함).
+- **VirtualAllocator**: device-wide VA 공간을 페이지 정렬로 할당. 이후
+  `RuntimeContext._create_tensor` 가 VA → PA 매핑을 `MmuMapMsg` 로 fabric
+  에 push.
+
+이 알고리즘들은:
+
+- **first-fit** 으로 단순.
+- 자유 블록 리스트는 **offset 정렬 (sorted by start)** 유지.
+- `free()` 시 **양쪽 인접 블록과 coalesce**.
+
+이런 결정의 근거가 어디에도 없으므로, 향후 누군가 "왜 best-fit 이 아닌가",
+"왜 buddy allocator 가 아닌가", "왜 partial overlap free 가 silently
+허용되는가" 라는 질문에 답할 기준이 필요. 본 ADR 이 그 기준을 마련한다.
+
+## Decision
+
+### D1. `_FreeList` — offset-기반 first-fit + coalescing
+
+`policy/address/allocator.py::_FreeList`:
+
+- 내부 표현: `list[tuple[int, int]]` = `[(start_offset, size), ...]` —
+  start offset 으로 정렬된 자유 블록의 sorted list.
+- `alloc(nbytes)`:
+  1. free list 를 앞에서부터 순회 (first-fit).
+  2. 처음 만나는 `size >= nbytes` 인 블록에서 앞부분을 잘라 사용.
+  3. 정확히 일치하면 블록 통째로 제거; 아니면 `(start+nbytes, size-nbytes)`
+     로 축소.
+  4. `_used += nbytes`, 잘라낸 `start` 반환.
+  5. 맞는 블록이 없으면 `AllocationError("overflow ... largest free block
+     ...")`.
+- `free(offset, nbytes)`:
+  1. `_used -= nbytes`.
+  2. `bisect_left(self._free, (offset,))` 로 삽입 위치 결정.
+  3. 직전 블록과 인접 (`prev_start + prev_size == offset`) 하면 흡수.
+  4. 직후 블록과 인접 (`offset+nbytes == next_start`) 하면 흡수.
+  5. coalesced range 를 정렬 위치에 insert.
+
+이 알고리즘은 fragmentation 에 약점이 있으나 (best-fit / buddy 대비), 본
+시뮬레이터의 워크로드 특성상 (deploy/free 패턴이 거의 stack-like) 충분
+하다는 것이 디자인 가정이다. 워크로드가 변하면 D1 supersede 후보.
+
+### D2. partial overlap free 는 **검사하지 않는다**
+
+`_FreeList.free(offset, nbytes)` 는 호출자가 정확한 (offset, nbytes) 를
+넘긴다고 신뢰한다. 다음을 검증하지 않는다:
+
+- 그 range 가 실제로 alloc 된 것인지.
+- 그 range 가 다른 alloc 된 영역과 겹치지 않는지.
+
+이유: 시뮬레이터 컨텍스트에서 호출자는 항상 `alloc()` 의 반환값을 그대로
+저장했다가 `free()` 에 넘기는 패턴이며, 외부 사용자 입력이 아니다. 안전성
+검사를 추가하면 매 free 마다 O(N) 비용이 들어 시뮬 wall-clock 에 영향.
+
+이 신뢰 모델이 깨지면 (예: 두 텐서가 같은 PA 를 가리키는 코드 경로 도입)
+즉시 ADR-level 으로 재검토.
+
+### D3. `PEMemAllocator` — HBM/TCM 두 채널 분리
+
+`PEMemAllocator(sip_id, die_id, pe_id, cfg)` 는 두 `_FreeList` 를 보유:
+
+- `_hbm`: `cfg.hbm_slice_bytes` 크기.
+- `_tcm`: `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
+  tcm_scheduler_reserved_bytes`) 크기.
+
+`alloc_hbm(nbytes) -> PhysAddr`:
+
+- `_hbm.alloc(nbytes)` 로 offset 획득.
+- `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id, pe_local_hbm_offset=offset,
+  slice_size_bytes=cfg.hbm_slice_bytes)` 로 PA 인코딩.
+- 실패 시 `AllocationError("HBM overflow ...")`.
+
+`free_hbm(pa, nbytes)`:
+
+- `pa.hbm_offset - pe_id * cfg.hbm_slice_bytes` 로 PE-local offset 복원.
+- `_hbm.free(offset, nbytes)`.
+
+`alloc_tcm(nbytes) -> PhysAddr`: 유사하게 `PhysAddr.pe_tcm_addr` 로 인코딩.
+
+`free_tcm(pa, nbytes)`: `pa.sub_offset` 을 그대로 사용 (TCM 은 PE-local
+offset 이 곧 sub_offset).
+
+scheduler-reserved TCM 영역 (`cfg.tcm_scheduler_reserved_bytes`) 은
+allocator 가 인지하지 않는다 (`_tcm` 의 capacity 에서 사전 차감되어 있음).
+이는 ADR-0014 의 PE_SCHEDULER 내부 buffer 예약과 정합된다.
+
+### D4. `VirtualAllocator` — 페이지 정렬 first-fit + coalescing
+
+`policy/address/va_allocator.py::VirtualAllocator`:
+
+- 내부 표현: `_FreeList` 와 동일한 sorted `list[tuple[int, int]]`.
+  최초: `[(va_base, va_size)]`.
+- `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
+- `alloc(nbytes) -> int`:
+  1. `aligned = _align_up(nbytes)`.
+  2. first-fit 으로 `size >= aligned` 인 블록 탐색.
+  3. 블록 앞부분 `aligned` 만큼 잘라 사용. 정확히 일치하면 제거.
+  4. `_used += aligned`. 블록 `start` (= aligned 된 VA) 반환.
+  5. 실패 시 `VaAllocationError`.
+- `free(va, nbytes)`: `_align_up(nbytes)` 단위로 free. _FreeList 와 동일한
+  coalesce 알고리즘.
+
+`page_size` 의 실제 값은 두 곳에서 다른 기본을 갖는다:
+
+- `VirtualAllocator.__init__` 의 매개변수 기본값: `2 MiB`. 직접 호출하는
+  테스트가 그대로 받는다.
+- `RuntimeContext._ensure_allocators` 가 인스턴스화할 때:
+  `pe_mmu.attrs.get("page_size", 4096)` — `topology.yaml` 의
+  `pe_mmu.attrs.page_size` 가 있으면 그 값, 없으면 fallback 4 KiB.
+
+두 기본이 다른 이유: VirtualAllocator 의 standalone 기본은 ADR-0039 의
+PE_MMU stopgap 기본 (2 MiB) 과 정합되어 직접 테스트가 자연스럽고, context
+fallback 의 4 KiB 는 topology 미설정 시 안전한 minimum page 다. 실제 사용
+경로는 항상 후자이며 (`_ensure_allocators` 가 인스턴스화하므로),
+`topology.yaml` 에서 `page_size` 가 명시되면 그 값이 양쪽 (MMU + VA
+allocator) 으로 일관되게 흐른다.
+
+만약 이 일치가 깨지면 (예: VirtualAllocator 의 page_size 를 PE_MMU 와
+다르게 인스턴스화) MMU `map()` 가 서브-페이지 region 모드 (ADR-0039 D3) 로
+흐른다.
+
+VA 기본 범위: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64 GiB`. 이
+값은 `_ensure_allocators` 에 하드코딩되어 있으며 ADR-0011 의 VA 모델에서
+직접적인 의미를 갖지는 않는다 — 단지 host 코드와 충돌하지 않을 만큼 큰
+주소 공간을 device-wide 로 잡아둔 것.
+
+### D5. allocator 인스턴스의 lifecycle
+
+- `RuntimeContext._ensure_allocators` 가 lazy 하게 호출됨 (`_create_tensor`
+  의 첫 호출 시점).
+- 한 번 생성된 allocator dict (`self._allocators`) 는 RuntimeContext 의
+  lifetime 동안 재사용. 같은 process 안의 두 번째 deploy 는 새 객체를
+  만들지 않는다.
+- `RuntimeContext.cleanup()` 이 모든 living tensor 의 `_free_tensor()` 를
+  호출 → MMU unmap + `va_allocator.free` + `pemem_allocator.free_hbm` 으로
+  free list 가 원상복구. 다음 RuntimeContext 가 다시 만들면 초기 상태부터.
+
+allocator 상태가 RuntimeContext 간에 공유되지 않는 점이 단일 process 안의
+연속 실행에서 deploy → cleanup → deploy 의 결정성을 보장한다.
+
+### D6. Allocator 실패는 raise 한다 (silent OOM 금지)
+
+`_FreeList.alloc` / `VirtualAllocator.alloc` 모두 충분한 free block 이
+없으면 `AllocationError` / `VaAllocationError` 를 던진다. 메시지에는
+"required size + largest available block" 가 포함되어, fragmentation
+인지 진짜 OOM 인지 진단 가능.
+
+silent fallback (예: 가장 큰 블록만큼만 alloc) 는 절대 금지 — 부분 할당된
+텐서가 SimPy 단계에 들어가면 라우팅·DMA 가 잘못된 PA 를 인지하여 시뮬
+정확도가 깨진다.
+
+### D7. address space 와 allocator 의 1:1 대응
+
+물리 주소 공간 분리는 PhysAddr 의 sub-unit (ADR-0001 D2.3) 으로 표현되며,
+각 sub-unit 마다 별도 allocator 인스턴스를 둔다:
+
+- HBM slice → `PEMemAllocator._hbm`.
+- PE TCM → `PEMemAllocator._tcm`.
+- (현재 미사용) M_CPU local memory, CUBE SRAM → 별도 allocator 필요. 현재
+  구현은 아직 IPCQ-only slot 으로 처리 (ADR-0023 D9.7) 하며 PA 공간을
+  share 하지 않으므로 별도 free-list 가 없음.
+
+cube-level SRAM allocator 가 필요해지면 `_FreeList(cfg.sram_bytes_per_cube)`
+인스턴스를 cube 단위로 추가한다 (`cfg.sram_bytes_per_cube` 는 이미
+`AddressConfig` 에 정의되어 있어 데이터 모델은 준비됨).
+
+## Alternatives Considered
+
+### A1. best-fit / buddy allocator
+
+기각 (현재). 워크로드의 alloc/free 패턴이 stack-like (deploy 순서 = free
+순서) 라 first-fit + coalescing 으로 fragmentation 이 충분히 통제된다.
+LLM kernel sweep 에서 long-running fragmentation 이 관찰되면 buddy 로
+교체하는 ADR 을 별도로 만든다.
+
+### A2. partial overlap free 검증 추가
+
+기각. D2 의 신뢰 모델 + O(N) 검사 비용. 단, 디버그 모드 (`KERNBENCH_DEBUG`
+env var 등) 에서 활성화하는 옵션은 후속 작업으로 가능.
+
+### A3. VA 와 PA 의 통합 allocator
+
+기각. VA 공간 (64 GiB device-wide) 과 PA 공간 (slice 별 ~6 GiB) 는 의미
+차원이 다르다. VA 는 host kernel 의 view, PA 는 device sub-unit 의 view.
+ADR-0011 의 VA 모델 정신 (MMU 가 둘 사이를 매핑) 과 정합하기 위해
+allocator 도 분리.
+
+### A4. page_size 의 multi-tier 지원 (large page + small page)
+
+기각 (현재). 단일 page_size (현재 2 MiB) 가 LLM kernel 의 텐서 단위 (수
+MiB~수 GiB) 에 맞고, ADR-0039 D3 의 서브-페이지 region 으로 작은 매핑이
+필요할 때 흡수된다. multi-tier page 는 MMU 자체 모델을 확장해야 하므로
+별도 ADR 후보.
+
+## Consequences
+
+- allocator 알고리즘이 ADR-level 에서 굳어져 (D1·D3·D4), 새로운 시뮬
+  시나리오에서 fragmentation 이슈가 발생할 때 "여기서 first-fit + coalesce
+  를 쓰고 있다" 가 명확.
+- D2 의 신뢰 모델이 명시되어, 향후 사용자 입력으로부터 직접 alloc/free 를
+  받는 경로가 도입되면 본 ADR supersede 가 필요함을 일찍 인지 가능.
+- D7 의 sub-unit별 allocator 1:1 대응이 명시되어, M_CPU/SRAM 별도 영역이
+  필요해질 때 어디에 free-list 를 추가해야 하는지 명확.
+- `VirtualAllocator` 의 page_size 가 PE_MMU 설정과 일치해야 함이 D4 에
+  적혀 있어, 향후 topology.yaml 의 page_size 변경 시 ADR-0039 stopgap 동작
+  과의 상호작용을 빠르게 가늠 가능.
@@ -0,0 +1,231 @@
+# ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
+
+## Status
+
+Accepted (2026-05-22).
+
+`probes/probe.py` 의 `run_probe(...)` 가 노출하는 traffic-pattern catalog,
+formula vs actual 비교, 그리고 monotonicity / D2H≥H2D 같은 invariant
+체크의 의미를 명시한다. ADR-0010 (CLI surface) 가 `kernbench probe`
+subcommand 를 enumerate 하나, **probe 가 실제로 측정하는 것**과 **어떤
+invariant 를 PASS/FAIL 로 판정하는가**는 ADR-level 에 없었다.
+
+## First action (제일 처음에 하는 일)
+
+`run_probe(topology_path, case_filter=None)` 의 첫 4가지 작업:
+
+1. `Path(topology_path).expanduser().resolve()` 로 절대 경로 산출.
+2. `load_topology(path)` → `TopologyGraph` 인스턴스 (그래프 + spec).
+3. `_build_edge_map(graph)` → `{(src, dst): Edge}` 빠른 lookup 테이블.
+4. `AddressResolver(graph)` + `PathRouter(graph)` 인스턴스화.
+
+그 다음 `nbytes = 32768` (= 32 KiB, summary table 의 기준 데이터 크기) 와
+`show_all = (case_filter is None or case_filter == "all")` 를 설정.
+
+즉, **probe 의 첫 일은 "토폴로지를 한 번 로드하여 edge map / resolver /
+router 를 준비하고, 32 KiB 라는 표준 측정 크기를 픽스하는 것"**. 그 이후
+H2D → D2H → PE DMA 세 카테고리의 case 들이 각각 별도의 `GraphEngine`
+인스턴스에서 실행된다 (case 간 cross-talk 차단).
+
+## Context
+
+`kernbench probe` 는 다음 의도로 도입된 verification 도구다:
+
+- **수동 분석 ground truth**: 실 시뮬레이션 (`kernbench run --bench ...`)
+  결과의 latency 가 비정상으로 보일 때, 단순 traffic pattern 의 정답을 별도
+  로 얻어 비교.
+- **formula vs actual 비교**: 분석 모델 (wire latency + overhead + drain)
+  과 시뮬레이션 결과 (`total_ns`) 가 일치하는지 확인. 일치하지 않으면 모델
+  단순화 가정 (ADR-0033) 어디가 빠진 것인지 단서.
+- **monotonicity check**: hop 수가 늘면 latency 가 단조 증가해야 한다는
+  invariant 의 자동 확인.
+- **utilization sweep**: 데이터 크기 (4 KiB ~ 1 MiB) 별 BW 활용률 표.
+
+이 도구의 동작 사양이 ADR-level 에 없으면:
+
+- 다른 형식의 traffic pattern (예: MCpuDma, IPCQ) 을 추가하려는 사람이 기존
+  카테고리의 표 포맷 / 측정 단위를 일관되게 따르기 어렵다.
+- monotonicity 가 무엇을 기준으로 검사되는지 (hop 수? cube 거리? wire
+  길이?) 모호.
+- 32 KiB 라는 기준 크기와 `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]` sweep
+  의 의미가 코드 grep 으로만 확인 가능.
+
+## Decision
+
+### D1. 세 가지 case category — H2D / D2H / PE DMA
+
+각 category 는 토폴로지 상 별개의 데이터 경로를 가지며, 별도의 summary
+table + sweep table + route detail block 으로 출력된다.
+
+- **H2D (Host→Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
+  dst_pe=0, pattern="zero")` 가 `pcie_ep → io_cpu → m_cpu → hbm_ctrl` 경로
+  를 흐른다. cube 인덱스로 hop 수가 증가:
+  - h2d-1hop: cube=0, hops=1
+  - h2d-2hop: cube=4, hops=2
+  - h2d-3hop: cube=8, hops=3
+  - h2d-4hop: cube=12, hops=4
+- **D2H (Device→Host Read)**: `MemoryReadMsg(src_sip=0, src_cube, src_pe=0)`.
+  forward command path + reverse data path 의 합 latency. 같은 4 hops
+  카테고리.
+- **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe, dst_pa)`.
+  5 가지 케이스로 cube/PE 위치 변화:
+  - pe-local-hbm: same cube, same PE
+  - pe-same-half-hbm: same cube, different PE (PE 1)
+  - pe-cross-half-hbm: same cube, far PE (PE 4)
+  - pe-cross-cube-hbm-best: adjacent cube (cube 1)
+  - pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
+
+cube 인덱스가 4/8/12 (H2D), 1/4/15 (PE DMA) 같이 의미 있는 이유는
+4x4 cube mesh (sip.cube_mesh.w=4, h=4) 에서의 거리 정의 — 추후 cube_mesh
+크기 변경 시 이 값들이 같이 갱신되어야 한다.
+
+### D2. 표준 측정 크기 — `nbytes = 32768` (32 KiB)
+
+모든 case 의 summary table 은 `nbytes=32768` 로 한 번 실행한 결과를
+보여준다. 32 KiB 가 선택된 이유:
+
+- DMA overhead 와 BW drain 이 한쪽으로 치우치지 않는 적당한 크기.
+- 다수 sub-unit (TCM, register file) 의 1회 transfer 단위와 비교 가능.
+
+크기별 utilization 변화는 별도 sweep table 이 보여준다 (D3).
+
+### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
+
+`SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`, `SWEEP_LABELS =
+["4KB", "16KB", "64KB", "256KB", "1MB"]`. 매 size 마다 다음 공식:
+
+```
+drain   = nbytes / bottleneck_bw
+total   = overhead + wire + drain
+eff_bw  = nbytes / total
+util%   = eff_bw / bottleneck_bw × 100
+```
+
+`bn_bw is None or <= 0` 이면 그 컬럼은 0.0 % 로 출력. 의미: hop 수가 늘
+수록 작은 transfer 는 overhead-bound, 큰 transfer 는 drain-bound 가 되는
+패턴을 한 표에서 확인.
+
+### D4. 측정 항목 — actual / formula / breakdown
+
+각 case 행에 표시되는 컬럼:
+
+- `Actual` (total_ns): SimPy 실행 결과의 `trace["total_ns"]`.
+- `Ovhd`: 경로상 모든 node 의 `node.attrs["overhead_ns"]` 합 (formula
+  breakdown).
+- `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
+- `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
+- `Ovhd%` / `Drain%`: Ovhd/Drain 이 Actual 에서 차지하는 비율 (formula 의
+  Wire 는 통상 매우 작아 표시하지 않음).
+- `Eff.BW`: `nbytes / total_ns` (실 측정 BW).
+- `BN.BW`: bottleneck bandwidth (formula). path 상 모든 edge 의 BW 중 최소.
+  edge BW 가 없으면 "-".
+- `Util%`: `Eff.BW / BN.BW × 100`. 100% 면 single-stream BW upper bound 에
+  도달.
+
+formula 의 합 (`wire + ovhd + drain`) 과 actual 의 차이가 크면 모델
+단순화가 잡지 못하는 요소가 있다는 신호 (ADR-0033 의 가정 점검).
+
+### D5. Invariant 자동 체크 — PASS/FAIL
+
+다음 invariant 들이 자동으로 확인되어 `[v] PASS` / `[x] FAIL` 로 출력:
+
+- **H2D / D2H monotonic increase**: hop 수가 늘면 actual latency 가
+  단조 증가해야 함. `all(lats[i] < lats[i+1] for ...)`.
+- **D2H ≥ H2D**: 같은 hop 인덱스에서 D2H ≥ H2D (D2H 는 forward command
+  + reverse data 두 leg 이므로). `all(d2h[i].total >= h2d[i].total)`.
+- **PE DMA best < worst**: cross-cube best (adjacent) latency < cross-cube
+  worst (diagonal) latency.
+- **PE DMA local vs remote**: local BN BW vs remote BN BW 의 비교 출력
+  (PASS/FAIL 이 아닌 정보성).
+
+체크가 FAIL 이면 사람이 즉시 모델/토폴로지 회귀를 인지할 수 있도록 한
+줄로 분명하게 출력.
+
+### D6. Route detail — per-hop timestamp trace
+
+summary 와 sweep 표 이후 각 case 의 path 와 per-hop 누적 시간 (
+`_hop_timestamps`) 가 별도 섹션에서 출력된다:
+
+- H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
+  (`m_cpu → hbm_ctrl`) + per-hop trace.
+- D2H: forward (cmd, no data) + reverse (data) trace 분리 표시.
+- PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
+
+각 hop 의 timestamp 는 cumulative `wire_ns + overhead_ns` 누적. terminal
+hop 의 annotation 에 `drain:Xns` 가 붙는다. bottleneck edge 는
+`<BN:XXGB/s>` 로 표시되어 시각적으로 식별 가능.
+
+### D7. case_filter 인자의 의미
+
+- `None` 또는 `"all"`: 모든 case 실행 (default).
+- 다른 문자열: 그 이름과 정확히 일치하는 case 만 실행. 예: `kernbench
+  probe --case h2d-2hop`.
+
+각 카테고리 안에서 `name != case_filter` 면 skip 되며, 그 카테고리의
+monotonicity / D2H≥H2D 비교는 데이터가 1개일 때 자연히 skip 된다.
+
+CLI parser 의 `--case` 기본값은 `"all"`이라 인자 생략 시 전체 실행.
+
+### D8. 매 case 별 fresh GraphEngine
+
+H2D 4개, D2H 4개, PE DMA 5개의 case 가 각각 **새로운 GraphEngine**
+인스턴스에서 실행된다 (`engine = GraphEngine(graph)`). 이유:
+
+- case 간 누적 상태 (op_log, completion 추적, allocator 등) 가 cross-talk
+  하지 않도록 격리.
+- 한 case 의 traffic 이 다른 case 의 BW 측정에 영향을 주지 않도록 보장.
+
+이 격리는 probe 의 측정 결과를 **각 case 단독 single-flow** 의 latency 로
+해석할 수 있게 한다. multi-flow contention 측정은 별도 도구 (예:
+`pe2pe_overview` 플롯, ADR-0033 의 multi-flow merging 모델) 책임.
+
+### D9. 출력 포맷의 안정성
+
+probe 의 stdout 출력은 사람이 읽기 위함이며, 정확한 컬럼 폭/구분자/공백 은
+machine-readable contract 가 아니다. 자동화된 도구가 probe 결과를 파싱
+하려면 별도 JSON 출력 모드를 추가해야 한다 (현재 미구현).
+
+PASS/FAIL 줄의 `[v]` / `[x]` 접두사는 CI grep 용 anchor 로 안정 보장.
+
+## Alternatives Considered
+
+### A1. Probe 를 별도 bench 로 등록 (`@bench(name="probe")`)
+
+기각. probe 는 bench 가 아니라 verification 도구로 의도된다 — sweep / 분석
+용 multi-engine 실행과 invariant PASS/FAIL 출력이 본질이며, ADR-0045 의
+"단일 디바이스 + 단일 RuntimeContext" bench 모델과 맞지 않는다.
+
+### A2. monotonicity 위반 시 exit code 1
+
+기각 (현재). 인간 검사 도구 위주로 의도되어 있어 PASS/FAIL 줄을 출력하고
+exit 0 로 종료. CI 가 violation 으로 fail 하길 원하면 별도 wrapper 가
+`grep "\[x\]"` 결과로 판단하면 됨. 후속으로 strict-mode flag (`--strict`)
+도입 가능.
+
+### A3. probe 의 case 정의를 외부 YAML 로
+
+기각 (현재). 8개 case (4 H2D + 4 D2H + 5 PE DMA — 합 13개) 는 코드에
+하드코딩되어 있고 의미가 토폴로지 mesh 구조에 단단히 묶여 있다. 외부
+YAML 로 옮기면 cube 인덱스의 의미 (4, 8, 12 / 1, 4, 15) 를 별도로 문서화
+해야 하므로 응집도 손실. 케이스 추가가 잦아지면 그때 별도 ADR 로 도입.
+
+### A4. multi-flow contention 측정 추가
+
+기각 (probe 범위 밖). D8 에서 명시한 single-flow 격리 모델이 probe 의 핵심
+의도. multi-flow contention 은 ADR-0033 latency model 의 다른 영역으로,
+별도 도구 또는 별도 case category 로 처리.
+
+## Consequences
+
+- probe 의 case catalog (D1) 와 측정 단위 (D2/D3) 가 ADR-level 에서 명시
+  되어, 새 traffic 카테고리 추가 시 어떤 표 포맷을 따라야 하는지 분명.
+- formula vs actual 의 컬럼 의미 (D4) 가 굳어져, probe 결과를 보고 "왜
+  Drain% 가 5% 인가 / 70% 인가" 같은 질문을 빠르게 ADR-0033 가정 점검으로
+  연결 가능.
+- invariant 자동 체크 (D5) 가 ADR 에 굳어져, 향후 latency 모델 변경 시
+  monotonicity / D2H≥H2D 회귀를 probe 가 즉시 잡아낸다는 안전망 정착.
+- D8 의 case 간 격리가 명시되어, probe 결과를 single-flow 측정으로 안전
+  하게 해석 가능. multi-flow 측정이 필요해지면 별도 도구 트랙이 필요함이
+  분명.
+- A2 의 strict-mode flag 가 후속 작업 후보로 기록되어, CI 통합 요구 시
+  최소 추가 작업으로 도입 가능.
@@ -0,0 +1,308 @@
+# ADR-0050: CCL Algorithm Module Contract — `ccl/algorithms/*.py`
+
+## Status
+
+Accepted (2026-05-22).
+
+`src/kernbench/ccl/algorithms/` 디렉터리 안의 모듈이 AHBM CCL backend
+(ADR-0047) 에서 collective algorithm 으로 사용되려면 갖춰야 할 인터페이스,
+kernel 시그너처, 그리고 새 알고리즘 추가 절차를 명시한다. ADR-0047 D3 가
+"algorithm 모듈은 `kernel`, `kernel_args`, optional `TOPO_NAME_TO_KIND` 를
+expose 해야 한다" 라고만 한 줄로 언급하나, **algorithm 모듈 작성자가 따라야
+할 contract** 는 ADR-level 에서 정리된 적이 없다. ADR-0045 가 bench 모듈
+contract 를 다루는 것과 짝을 이룬다.
+
+## First action (제일 처음에 하는 일)
+
+알고리즘 모듈이 import 되는 시점은 두 가지다:
+
+1. **AHBM backend 진입**: 사용자 코드가 `dist.init_process_group(backend="ahbm")`
+   를 호출하면, `AhbmCCLBackend.__init__` 안에서 `self._algo_module =
+   importlib.import_module(self._merged["module"])` 가 실행된다. 이때 모듈
+   레벨에서 가장 먼저 일어나는 일:
+   - `SIP_TOPO_RING/TORUS/MESH` 같은 정수 상수가 모듈 namespace 에 노출.
+   - `TOPO_NAME_TO_KIND` 사전이 모듈 namespace 에 노출 — backend 가
+     `topo_map = getattr(self._algo_module, "TOPO_NAME_TO_KIND", None)` 로
+     조회.
+   - `kernel_args` 함수 정의 — 호출 시 호출자가 사용.
+   - `allreduce_intercube_multidevice` 같은 알고리즘 함수 정의.
+   - 모듈 마지막 줄에서 `kernel = allreduce_intercube_multidevice` 로
+     alias 가 노출.
+
+2. **ccl.yaml install 단계**: `kernbench.ccl.install.install_ipcq` 가 호출
+   되어 IPCQ neighbor table 을 푸시할 때 같은 알고리즘 모듈이 import 됨.
+
+즉, **algorithm 모듈의 첫 일은 "topology-kind 상수, `TOPO_NAME_TO_KIND`
+사전, `kernel_args` 함수, 그리고 `kernel` alias 를 모듈 namespace 에 노출
+하는 것"** 이다. 모든 노출은 import-time 부수효과로 충분하며 별도 초기화
+함수 호출이 필요하지 않다.
+
+## Context
+
+`AhbmCCLBackend` (ADR-0047) 는 process group 초기화 시점에 `ccl.yaml` 의
+`defaults.algorithm` (또는 사용자가 지정한 알고리즘 이름) 으로부터 모듈
+경로를 얻어 dynamic import 한다. backend 는 그 모듈로부터 다음 4 가지를
+기대한다:
+
+- `kernel`: collective 의 진입 함수.
+- `kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple`: kernel 에
+  넘길 위치 인자 묶음.
+- `TOPO_NAME_TO_KIND` (optional): `topology.yaml` 의 `sips.topology`
+  문자열 (예: `"ring_1d"`, `"torus_2d"`, `"mesh_2d_no_wrap"`) 을 정수
+  상수로 매핑하는 dict.
+- (간접) IPCQ neighbor table 설치: `configure_sfr_intercube_multisip` 가
+  알고리즘 모듈의 `TOPO_NAME_TO_KIND` 와 `cube_w/h` 를 보고 SFR 을 결정.
+
+현재 코퍼스의 유일한 algorithm 모듈은 `lrab_hierarchical_allreduce.py`
+(248 줄) 이다. 이름은 "**l**eft-**r**ight **a**lternating **b**roadcast
+**hierarchical allreduce**". 향후 `ring_allreduce`, `tree_allreduce`,
+`broadcast` 같은 모듈이 추가될 때마다 이 contract 를 따라야 일관된
+디스패치가 가능하다.
+
+이 contract 가 ADR-level 에 없으면:
+
+- 새 algorithm 작성자가 ADR-0047 D3 의 한 줄 만으로 시그너처를 추론해야.
+- kernel 함수 인자 순서 (특히 `t_ptr, n_elem, cube_w, cube_h, n_sips,
+  sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl`) 의 의미가 코드
+  grep 없이는 명확하지 않다.
+- `kernel_args` 가 어떤 인자를 받고 어떤 tuple 을 돌려줘야 하는지 관례
+  로만 굳어진다.
+
+## Decision
+
+### D1. algorithm 모듈은 4 가지 public symbol 을 노출한다
+
+```python
+# src/kernbench/ccl/algorithms/<name>.py
+
+from __future__ import annotations
+
+# (필수) topology-kind 상수 — 알고리즘 내부에서 사용
+SIP_TOPO_RING = 0
+SIP_TOPO_TORUS = 1
+SIP_TOPO_MESH = 2
+
+# (선택) topology 이름 → kind 매핑. backend 가 ccl.yaml/topology 의
+# 문자열 SIP topology 를 정수로 변환하는 데 사용.
+TOPO_NAME_TO_KIND = {
+    "ring_1d": SIP_TOPO_RING,
+    "torus_2d": SIP_TOPO_TORUS,
+    "mesh_2d_no_wrap": SIP_TOPO_MESH,
+}
+
+# (필수) kernel 인자 빌더
+def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
+    return (n_elem, cube_w, cube_h, world_size)
+
+# (필수) kernel 함수 (`tl=...` 키워드를 통해 TLContext 가 주입됨)
+def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips,
+                         sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl):
+    ...
+
+# (필수) kernel alias — backend 가 `module.kernel` 로 접근
+kernel = my_allreduce_kernel
+```
+
+- `kernel` alias 는 backend 가 직접 호출하는 entry point 다. 함수 이름이
+  무엇이든 (`allreduce_intercube_multidevice` 처럼) `module.kernel = fn`
+  으로 노출해야 한다.
+- `kernel_args` 가 없으면 backend 가 알고리즘 인자를 만들 방법이 없다.
+  signature 는 D2 참고.
+- `TOPO_NAME_TO_KIND` 가 없으면 backend 는 `sip_topo_kind = 0` 으로
+  fallback 한다. 단일 topology 만 지원하는 알고리즘이라면 생략 가능.
+
+### D2. `kernel_args` 시그너처 — `(world_size, n_elem, *, cube_w, cube_h)`
+
+```python
+def kernel_args(world_size: int, n_elem: int, *,
+                cube_w: int = 4, cube_h: int = 4) -> tuple:
+    return (n_elem, cube_w, cube_h, world_size)
+```
+
+- **위치 인자**: `world_size` (= rank 수), `n_elem` (= 단일 shard 의
+  element 수, f16 기준).
+- **키워드 인자**: `cube_w`, `cube_h` (= cube mesh 크기). default 는
+  4×4 — `topology.yaml` 의 `sip.cube_mesh` 기본값과 정합.
+- **반환**: kernel 의 위치 인자 순서대로 묶은 tuple.
+
+backend 의 `all_reduce` 가 호출 시:
+
+```python
+kernel_args_tuple = self._algo_module.kernel_args(
+    self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h,
+)
+extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)
+pending = self.ctx.launch(
+    self._merged["algorithm"], kernel_fn, tensor,
+    *kernel_args_tuple, *extra_args, _defer_wait=True,
+)
+```
+
+즉 kernel 의 최종 위치 인자는: `(tensor_ptr, *kernel_args_tuple,
+sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` 이며, 거기에 `tl=...` 가
+키워드로 자동 주입된다. `kernel_args` 가 돌려주는 tuple 의 길이/순서는
+**kernel signature 와 1:1 일치** 해야 한다.
+
+### D3. `kernel` 함수 시그너처 — 정형화된 9 + tl 인자
+
+권장 시그너처:
+
+```python
+def my_kernel(
+    t_ptr: int,             # VA base of the row-wise-sharded tensor on this SIP
+    n_elem: int,            # element count per cube tile (or per shard)
+    cube_w: int,            # cube mesh width  (kernel_args 에서 옴)
+    cube_h: int,            # cube mesh height (kernel_args 에서 옴)
+    n_sips: int,            # world_size 와 동일 (rank = SIP, ADR-0024)
+    sip_rank: int,          # 이 SIP 의 rank
+    sip_topo_kind: int,     # TOPO_NAME_TO_KIND lookup 결과
+    sip_topo_w: int,        # SIP mesh width (ring_1d 면 0)
+    sip_topo_h: int,        # SIP mesh height (ring_1d 면 0)
+    *, tl,                  # TLContext (auto-injected)
+) -> None:
+```
+
+`kernel_args` 가 다른 위치 인자 순서를 채택하더라도, kernel 의 **마지막
+4 개 위치 인자는 항상 `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`**
+이며 backend 가 `extra_args` 로 append 한다 (ADR-0047 D5). 이 4 개 인자는
+사용자 정의 algorithm 도 받아야 하지만, 알고리즘이 single-SIP 이라면
+그냥 무시하면 된다.
+
+`tl` 은 위치 인자가 아닌 키워드로 주입된다 — `RuntimeContext.launch` 가
+kernel 호출 직전에 `tl=tl_ctx` 를 추가한다. 따라서 kernel signature 의
+`tl` 은 keyword-only (`*, tl`) 또는 마지막 키워드 매개변수 형태여야
+한다.
+
+### D4. kernel body 의 자유도와 제약
+
+kernel body 안에서 사용 가능한 표면: ADR-0046 D3 의 모든 `tl.*` primitive.
+
+특히 자주 쓰이는 패턴:
+
+- `cube_id = tl.program_id(axis=1)` — 이 PE 가 속한 cube 인덱스.
+- `pe_addr = t_ptr + cube_id * nbytes` — cube-별 tile 의 VA 계산.
+- `acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")` — local 데이터
+  로드.
+- `tl.send(dir=...)` / `tl.recv(dir=..., shape=, dtype=)` — IPCQ
+  collective.
+- `acc = acc + recv` — TensorHandle 산술 연산자 (ADR-0046 D4).
+- `tl.store(pe_addr, acc)` — 결과 저장.
+
+kernel body 는 일반 Python 함수이며, branching/looping 자유. 단:
+
+- SimPy `yield` 또는 `async` 금지 (ADR-0046 D1).
+- TensorHandle 의 `.data` 직접 접근 금지 — phase 1 timing 모델은
+  데이터 의존을 모른다 (ADR-0020 의 2-pass 분리).
+- kernel 실행은 deterministic 해야 한다 — 같은 입력으로 두 번 실행하면
+  같은 op 시퀀스 발사. random / external IO 금지.
+
+### D5. SIP topology semantics — `sip_topo_kind` 의 의미
+
+backend 가 `topology.yaml` 의 `system.sips.topology` 문자열을 algorithm
+모듈의 `TOPO_NAME_TO_KIND` 로 lookup 하여 `sip_topo_kind` 정수로 변환.
+algorithm 은 이 정수를 보고 분기:
+
+```python
+if sip_topo_kind == SIP_TOPO_RING:
+    acc = _inter_sip_ring(...)
+elif sip_topo_kind == SIP_TOPO_TORUS:
+    acc = _inter_sip_torus_2d(...)
+elif sip_topo_kind == SIP_TOPO_MESH:
+    acc = _inter_sip_mesh_2d(...)
+```
+
+각 topology branch 는 IPCQ direction 이름 (예: `"global_E"`, `"W"`, `"S"`,
+`"N"`) 을 통해 peer 와 통신. direction 의 의미는 ADR-0023/0025 가 정의
+하며, `configure_sfr_intercube_multisip` 가 IPCQ neighbor table 을 그에
+맞춰 설치한다.
+
+algorithm 모듈은 자기가 지원하지 않는 topology kind 가 들어오면 silent
+no-op 으로 두기보다 명시적으로 `raise ValueError(f"unsupported topology
+kind {sip_topo_kind}")` 하는 것을 권장 — 실수로 backend 에 잘못 dispatch
+된 경우 빠르게 fail.
+
+### D6. ccl.yaml 의 algorithm entry 구조
+
+algorithm 모듈은 `ccl.yaml` 의 entry 와 짝을 이룬다 (ADR-0023 D10 +
+ADR-0047 D3):
+
+```yaml
+defaults:
+  algorithm: lrab_hierarchical_allreduce
+  n_elem: 8
+
+algorithms:
+  lrab_hierarchical_allreduce:
+    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
+    # optional: world_size override
+    # optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip
+```
+
+- `module`: full Python module path. backend 의 `importlib.import_module`
+  가 이 문자열을 그대로 사용.
+- `world_size` (optional): 명시되면 topology fallback 을 override
+  (ADR-0047 D2).
+- algorithm-specific parameters 는 `configure_sfr_intercube_multisip` 가
+  소비.
+
+새 algorithm 추가 시:
+
+1. `src/kernbench/ccl/algorithms/<name>.py` 작성 (D1 컨벤션).
+2. `ccl.yaml` 의 `algorithms` 섹션에 entry 추가.
+3. (필요 시) `kernbench.ccl.sfr_config` 에 SFR 설치 분기 추가.
+4. test 추가 (예: `tests/sccl/test_<name>.py`, ADR-0043 의 eval harness
+   확장).
+
+### D7. legacy "rank = flat PE index" 모드
+
+ADR-0047 D2 가 명시한 `ccl.yaml` 의 `world_size` override 경로는 legacy
+"rank = flat PE index" 테스트가 사용한다. algorithm 모듈은 이 모드 에서도
+`n_sips=world_size` 만큼의 rank 가 들어옴을 가정하면 된다 — backend 가
+rank↔(SIP, cube, PE) 매핑을 사전에 분리해 두므로 algorithm 본체에서는
+modal 분기가 필요 없다.
+
+단, single-cube workload 에서는 `cube_w=cube_h=1` 이 들어와 mesh-기반
+phase 들이 skip 되도록 작성해야 한다 (`lrab_hierarchical_allreduce.py`
+의 `single_cube = (cube_w == 1 and cube_h == 1)` 패턴 참고).
+
+## Alternatives Considered
+
+### A1. algorithm 모듈을 class 로 구조화 (`class Allreduce: kernel(...)` 등)
+
+기각. Python 모듈 namespace 자체가 algorithm 의 identity 로 사용 중이며
+(ADR-0047 D3 의 `importlib.import_module`), class 한 겹은 추가 indirection
+만 늘리고 dispatch 측 코드를 두텁게 만든다. 모듈-레벨 free function
+ `kernel` alias 패턴이 충분히 명확.
+
+### A2. `kernel_args` 를 명시적 dataclass 로 typing
+
+기각 (현재). algorithm 마다 인자 갯수가 다른 것이 정상이며, dataclass 한
+종류를 강제하면 다양한 algorithm 간 호환이 어려워진다. tuple 반환은 simple
+하고 backend 측 `*kernel_args_tuple` unpacking 과 깨끗이 맞물린다.
+algorithm 별 자체 타입 강도가 필요해지면 그 algorithm 모듈 안에서 NamedTuple
+사용은 자유.
+
+### A3. SFR 설치를 algorithm 모듈 안으로
+
+기각. SFR 설치 (`configure_sfr_intercube_multisip`) 는 topology + algorithm
+모두를 보고 IPCQ neighbor table 을 설치하는 cross-module 결정이라, algorithm
+모듈 내부보다 `kernbench.ccl.sfr_config` 같은 전용 위치가 자연스럽다. D6 의
+"필요 시 sfr_config 분기 추가" 워크플로우가 책임 분리 측면에서 더 명확.
+
+### A4. algorithm name 을 모듈 namespace 에 자동 등록 (decorator)
+
+기각. ADR-0045 (bench) 와 달리 algorithm 은 ccl.yaml entry 와 직접 묶여
+있어 추가 등록 레지스트리가 중복이다. `module` 문자열 매핑 하나면 충분.
+
+## Consequences
+
+- ADR-0047 D3 의 한 줄 contract 가 D1–D7 의 작성자-친화적 가이드로 확장
+  되어, 새 algorithm 추가 시 시그너처를 grep 으로 추론할 필요 없음.
+- D3 의 9 + tl 인자 시그너처가 표준화되어, backend 의 `extra_args` append
+  (ADR-0047 D5) 와 자연스럽게 맞물림. 향후 single-SIP-only algorithm 도
+  4 개의 sip_* 인자를 받아야 함이 명시.
+- D5 의 fail-loud 권장으로, ccl.yaml 의 topology 가 algorithm 미지원
+  topology 로 잘못 설정되면 backend 가 silent wrong-result 가 아닌
+  ValueError 로 fail.
+- D6 의 단계별 추가 절차가 명시되어, 새 algorithm 추가가 sfr_config /
+  test / ccl.yaml 어디까지 손대야 하는지 분명.
@@ -0,0 +1,267 @@
+# ADR-0051: Routing Helper API — `AddressResolver` + `PathRouter`
+
+## Status
+
+Accepted (2026-05-22).
+
+`policy/routing/router.py` 가 노출하는 두 helper 클래스
+(`AddressResolver`, `PathRouter`) 의 모든 public API, 인자, 반환 값,
+그리고 네 가지 다른 adjacency graph 의 사용처를 명시한다. ADR-0002 가
+routing distance 와 ordering, bypass 규칙을 정의하나, **helper API 표면
+자체** 는 ADR-level 에 정리된 적이 없다.
+
+## First action (제일 처음에 하는 일)
+
+### `AddressResolver(graph)`
+
+생성 즉시 다음 두 가지를 캐시한다:
+
+1. `self._node_ids = set(graph.nodes)` — 모든 node id 의 set (lookup 용).
+2. `self._hbm_slice_bytes = hbm_total_gb * (1 << 30) // slices_per_cube` —
+   `graph.spec.cube.memory_map` 으로부터 산출 (기본 `48 GB / 8 slices = 6
+   GB`). 이 값이 `resolve()` 가 HBM PA 의 `hbm_offset` 에서 `pe_id` 를
+   복원하는 데 쓰인다.
+
+즉, **AddressResolver 의 첫 일은 "전체 node id 집합과 HBM slice 크기를
+미리 계산해 두는 것"** 이다. graph 자체는 보유하지 않는다.
+
+### `PathRouter(graph)`
+
+생성 즉시 **네 개의 별도 adjacency graph 를 동시 구축**한다:
+
+1. `self._adj_all`: 모든 edge 포함 (component-to-component routing 용).
+2. `self._adj`: `kind != "command"` 인 edge 만 (PE DMA / 일반 data path).
+3. `self._adj_mcpu_dma`: `_MCPU_DMA_EXCLUDE = {"pe_internal",
+   "pe_to_router"}` 를 제외 (M_CPU DMA 가 PE pipeline 노드로 잘못 라우팅
+   되지 않게).
+4. `self._adj_local`: `_UCIE_KINDS` 8 종을 제외 (cube-local routing 용 —
+   UCIe 가 zero-distance bus 처럼 보여 Dijkstra 가 mesh 보다 선호하는
+   것을 막음).
+
+각 그래프는 `defaultdict(list)` of `(neighbor, weight)` 형태이며,
+`edge.routing_weight_mm or edge.distance_mm` 이 weight 로 쓰인다.
+
+즉, **PathRouter 의 첫 일은 "topology edge 들을 4개의 다른 정책으로 동시
+분류하여 4 개의 인접 리스트로 구축하는 것"**. 매 `find_*()` 호출 시 적절
+한 그래프를 골라 Dijkstra 를 돌린다.
+
+## Context
+
+`policy/routing/router.py` 는 다음 두 책임을 함께 수행한다:
+
+- **이름 매핑**: 토폴로지 명명 규칙 (`sip{S}.cube{C}.<comp>`,
+  `sip{S}.io{I}.pcie_ep` 등) 의 단일 소유자. 컴포넌트 / probe / IPCQ
+  install / runtime API 가 이름 문자열을 직접 만들지 않고 helper 를 호출.
+- **경로 결정**: edge 의 `kind` 에 따른 정책 분리. 같은 src→dst 라도
+  routing 의도 (PE DMA vs M_CPU DMA vs general component routing) 에 따라
+  다른 adjacency 를 사용해야 결과가 달라진다.
+
+이 helper API 가 코드 전반에서 광범위하게 소비되는데도 (probe.py /
+distributed.py / install.py / 각종 component / tests), ADR-level 에서
+**정확한 시그너처 / 반환 의미 / 어떤 adjacency 를 쓰는지** 가 한 곳에
+정리되어 있지 않다. 본 ADR 이 그 빈자리를 채운다.
+
+## Decision
+
+### D1. `AddressResolver` 의 5 개 public API
+
+#### D1.1. `resolve(addr: PhysAddr) -> str`
+
+`PhysAddr` 인스턴스를 토폴로지의 destination node id 로 변환.
+
+```
+addr.kind == "hbm"             → f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
+  where pe_id = addr.hbm_offset // self._hbm_slice_bytes  (ADR-0017 D4/D9)
+
+addr.kind == "pe_resource":
+  addr.unit_type == PE         → f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
+  addr.unit_type == SRAM       → f"sip{s}.cube{d}.sram"
+  addr.unit_type == MCPU       → f"sip{s}.cube{d}.m_cpu"
+  그 외                          → RoutingError("unsupported unit_type")
+
+다른 kind                       → RoutingError("unsupported address kind")
+```
+
+산출된 node id 가 `self._node_ids` 에 없으면 `RoutingError(f"node {node_id}
+not found in topology")`. 즉, address 의 syntax 가 valid 해도 topology 에
+실제로 매핑되는 노드가 없으면 fail-loud.
+
+#### D1.2. `find_m_cpu(sip, cube) -> str`
+
+`f"sip{sip}.cube{cube}.m_cpu"`. 없으면 `RoutingError`.
+
+#### D1.3. `find_pcie_ep(sip, io_id="io0") -> str`
+
+`f"sip{sip}.{io_id}.pcie_ep"`. 없으면 `RoutingError`.
+
+#### D1.4. `find_io_cpu(sip, io_id="io0") -> str`
+
+`f"sip{sip}.{io_id}.io_cpu"`. 없으면 `RoutingError`.
+
+#### D1.5. `find_all_pcie_eps() -> list[str]`
+
+전 SIP 의 PCIE_EP node id 를 정렬된 리스트로 반환. `endswith(".pcie_ep")`
+필터링. cross-SIP IPCQ 가 모든 PCIE_EP 를 enumerate 할 때 사용.
+
+명명 규칙 (`sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`) 의 단일
+소유자가 이 클래스다 (ADR-0015 D4). 토폴로지 빌더가 같은 명명 규칙으로
+노드를 만들고, 컴포넌트는 이름 문자열을 절대 직접 구성하지 않는다 —
+모두 helper 를 거친다.
+
+### D2. `PathRouter` 의 4 개 adjacency graph
+
+생성자가 한 번에 구축. edge `kind` 가 정책을 결정:
+
+| graph             | 제외 edge kinds                               | 용도                                       |
+|-------------------|-----------------------------------------------|--------------------------------------------|
+| `_adj_all`        | (none)                                        | M_CPU↔NOC command 포함, IO_CPU/M_CPU routing |
+| `_adj`            | `"command"`                                   | PE DMA / 일반 data path                    |
+| `_adj_mcpu_dma`   | `"pe_internal"`, `"pe_to_router"`            | M_CPU DMA (PE pipeline 우회)               |
+| `_adj_local`      | `_UCIE_KINDS` (`ucie_internal`, `ucie_conn_to_router`, `router_to_ucie_conn`, `ucie_conn_to_noc`, `noc_to_ucie_conn`, `ucie_mesh`, `io_to_cube`, `cube_to_io`) | same-cube routing (UCIe bus 우회) |
+
+각 그래프는 `dict[node_id, list[(neighbor, weight)]]` 이며, weight 는
+`edge.routing_weight_mm or edge.distance_mm`. command edge 의 routing
+영향력을 명시적으로 가르고, UCIe 의 "0-distance bus" 가 mesh 보다 선호
+되는 것을 막기 위한 `_adj_local` 분리가 ADR-0017 D7 의 cross-PE-slice
+mesh-distance 요구와 정합.
+
+### D3. `PathRouter` 의 6 개 public API (+ 2 backward-compat)
+
+#### D3.1. `find_path(src_pe: str, dst_node: str) -> list[str]`
+
+**PE DMA routing**. `src_pe` 는 PE prefix (예: `"sip0.cube0.pe0"`) 이며,
+함수가 `.pe_dma` 를 자동으로 prepend 하여 실제 시작 노드를
+`"sip0.cube0.pe0.pe_dma"` 로 설정.
+
+cube-local 여부 (`_same_cube`) 에 따라 adjacency 선택:
+
+- **same-cube** (src 와 dst 가 `sip{S}.cube{C}.` prefix 공유):
+  `_adj_local` 사용. UCIe 우회를 막아 cross-PE-slice 가 mesh 거리를 정확
+  히 지불 (ADR-0017 D7).
+- **cross-cube**: `_adj` 사용. UCIe 가 자연스럽게 cross-cube path 의
+  최적 선택지로 포함됨.
+
+#### D3.2. `find_path_with_distance(src_pe, dst_node) -> tuple[list[str], float]`
+
+D3.1 과 동일한 adjacency 정책을 사용하나, 결과로 `(path, total_distance)`
+를 함께 반환. probe / 분석 도구에서 distance 메트릭이 필요할 때 사용.
+
+#### D3.3. `find_mcpu_dma_path(m_cpu_id: str, dst_hbm_id: str) -> list[str]`
+
+**M_CPU DMA path**. cube 가 같으면 `_adj_local` (mesh 안에서 마무리), 다르
+면 `_adj_all` (UCIe 경유). `_MCPU_DMA_EXCLUDE` 가 PE pipeline 노드를 자동
+배제하므로, M_CPU 가 PE 의 내부 stage 를 거쳐 routing 되는 잘못된 경로가
+나오지 않는다.
+
+#### D3.4. `find_memory_path(src: str, dst: str) -> list[str]`
+
+`pcie_ep → io_noc → cube → router mesh → hbm_ctrl` 같은 직접 메모리
+경로. `_adj_mcpu_dma` 를 사용하여 `pe_internal` 및 `pe_to_router` edge
+를 제외 — host-issued read/write 가 PE pipeline 으로 새지 않게 보장.
+probe (ADR-0049 D1 의 H2D/D2H case) 에서 직접 호출.
+
+#### D3.5. `find_node_path(src: str, dst: str) -> list[str]`
+
+임의의 두 node 사이의 path. **command edge 포함** (`_adj_all` 사용). M_CPU
+↔ NOC 같은 command-kind link 를 거쳐야 하는 IoCpuComponent /
+MCpuComponent 등이 호출.
+
+#### D3.6. backward-compat shims
+
+- `_dijkstra(start, goal) -> list[str]` — `_run_dijkstra(self._adj, …)`
+  의 thin wrapper.
+- `_dijkstra_with_dist(start, goal) -> tuple[list[str], float]` — distance
+  포함 버전.
+
+언더스코어 prefix 에서 보듯이 내부 API 인 척이지만 기존 테스트가 직접
+호출. 새 코드는 D3.1–D3.5 를 사용하고, 이 두 shim 은 deprecation 후보.
+
+### D4. Dijkstra 알고리즘 — single-source shortest path
+
+`_run_dijkstra_with_dist(adj, start, goal)`:
+
+- `heapq` priority queue.
+- `best: dict[node, distance]` — 노드별 최단 거리 캐시.
+- `prev: dict[node, predecessor]` — path reconstruction.
+- weight 는 `routing_weight_mm or distance_mm`. UCIe 처럼 routing_weight 가
+  명시되어 distance 와 다른 edge 가 있으므로 weight 분리가 의도된 것.
+
+`start == goal` 은 빠른 path `([start], 0.0)` 반환. 도달 불가는
+`RoutingError(f"no path from {start} to {goal}")`.
+
+이 알고리즘은 **deterministic** 하다 — 같은 graph + start/goal 이면 같은
+경로. 이는 SPEC R1 의 "Routing MUST be deterministic" 요구와 정합. tie-
+break 는 `heapq` 의 push 순서를 따른다 (Python list 순서가 deterministic).
+
+### D5. helper API 의 단일 소유자 원칙
+
+다음 정보는 오직 router.py 안에서만 결정된다:
+
+- 명명 규칙: `sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`,
+  `sip{S}.cube{C}.hbm_ctrl.pe{pe_id}`.
+- adjacency 정책: 어떤 edge kind 가 어떤 그래프에 포함되는가.
+- HBM slice 크기로부터 PE id 복원 방법.
+- Dijkstra의 weight 결정 (`routing_weight_mm or distance_mm`).
+
+이 단일 소유자 원칙이 깨지면 (예: 컴포넌트가 자체적으로 `f"sip{s}..."` 를
+구성하기 시작하면) 명명 규칙 변경 시 영향 범위가 폭발한다. ADR-0015 D4 의
+정신과 정렬.
+
+### D6. helper API consumer 의 목록
+
+본 helper 가 노출하는 메소드를 호출하는 곳을 명시 (현재 코퍼스 기준):
+
+- `probes/probe.py` (ADR-0049): `find_pcie_ep`, `find_io_cpu`,
+  `find_m_cpu`, `find_node_path`, `find_mcpu_dma_path`,
+  `find_memory_path`, `find_path`, `resolve`.
+- `runtime_api/distributed.py` (ADR-0047): 간접 (engine 내부 routing).
+- `ccl/install.py` (ADR-0023): `find_all_pcie_eps`, `resolve`.
+- `sim_engine/event_log.py`: probe 와 유사하게 `find_pcie_ep`,
+  `find_memory_path`.
+- `components/builtin/m_cpu.py`, `components/builtin/io_cpu.py`:
+  `find_node_path`, `find_mcpu_dma_path`.
+- 각종 tests (test_routing.py, test_cross_sip_routing.py 등): D3.1–D3.5
+  대부분.
+
+새 consumer 가 추가될 때 본 ADR 의 D1/D3 가 그 의도에 맞는 메소드가
+이미 있는지 / 새 메소드를 추가해야 하는지 1차 판단의 기준이 된다.
+
+## Alternatives Considered
+
+### A1. 단일 adjacency graph + edge-kind filter 동적 적용
+
+기각. 매 `find_*()` 마다 graph filtering 을 다시 하면 Dijkstra 의 cache
+locality 와 성능이 떨어진다. 4 개 그래프 동시 구축 (D2) 은 메모리 비용
+이 작고 (edge ≤ 수만 건 규모), 호출 시점에 정책 선택이 O(1) 로 결정.
+
+### A2. adjacency 분리를 edge 의 `kind` 가 아닌 별도 metadata 로
+
+기각. edge `kind` 는 이미 topology builder 가 부여하며 (ADR-0015 D4 +
+ADR-0017), 별도 metadata 를 도입하면 두 시스템이 동기화되어야 하는
+중복이 생긴다.
+
+### A3. Dijkstra 대신 BFS + uniform weight
+
+기각. routing_weight_mm 이 edge 별로 다른 (mesh link / UCIe / IO-internal)
+현실에서 BFS 는 hop 수 최소화일 뿐 latency / distance 최단을 보장하지
+않는다. SPEC R1 + R2 의 결정적·정확한 routing 요구에 어긋남.
+
+### A4. helper API 를 클래스 메서드가 아닌 모듈 함수로
+
+기각. 두 클래스 (`AddressResolver`, `PathRouter`) 가 각각 cache 상태
+(`_node_ids`, `_hbm_slice_bytes`, 4 adjacency graphs) 를 보유해야 하며,
+같은 graph 인스턴스에 여러 routing 질의가 발생한다. 모듈 함수는 매 호출
+시 state 를 다시 만들거나 global 로 두어야 해서 안전성/성능 저하.
+
+## Consequences
+
+- 컴포넌트 / probe / IPCQ install / runtime API 가 모두 router.py 의
+  helper 만 호출하면 명명 규칙 변경 (예: `.io0.` → `.iochiplet0.`) 이
+  단 한 파일 수정으로 끝남 (D5).
+- D2 의 4 그래프 분리가 ADR 에 굳어져, 새 edge kind 가 추가될 때 (예:
+  Inter-die UCIe link 의 새 kind) 어느 그래프에 포함시킬지 결정의 명확
+  한 기준 제공.
+- D3.1 의 cube-local vs cross-cube 분기 (ADR-0017 D7) 가 명시되어, 향후
+  routing 동작을 변경하려는 사람이 어느 adjacency 를 건드려야 할지 안다.
+- D6 의 consumer 목록이 명시되어, helper API 변경 시 PR review 범위가
+  분명. backward-compat shim (D3.6) 의 deprecation 후보가 식별됨.
@@ -0,0 +1,352 @@
+# ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals
+
+## Status
+
+Accepted (2026-05-22).
+
+`sim_engine/op_log.py` 의 `OpRecord` 스키마와 `OpLogger` 의 record_start /
+record_end / record_copy 동작, 그리고 `sim_engine/memory_store.py` 의
+`MemoryStore` 가 사용하는 (space, addr) 주소공간 namespace 와 read/write
+의미를 명시한다. ADR-0020 (2-pass data execution) 가 두 인프라의 존재를
+선언하나, **레코드의 정확한 필드와 의미** 는 ADR-level 에서 정리되지
+않았고 ADR-0046 D3.2 (`tl.store` visibility), ADR-0023 D9 (IPCQ copy
+record) 등 여러 ADR 이 이들의 동작에 의존하고 있다.
+
+## First action (제일 처음에 하는 일)
+
+### `OpLogger(memory_store=None)`
+
+생성 즉시 다음 3 가지 필드 초기화:
+
+1. `self._records: list[OpRecord] = []` — 누적된 op record.
+2. `self._pending: dict[int, dict] = {}` — `id(msg)` 키로 partial record
+   (record_start 시점에 만들어졌고 record_end 가 아직 안 온 것).
+3. `self._memory_store = memory_store` — 옵션 MemoryStore reference.
+   math op 의 input 스냅샷 + dma_write 의 HBM source 스냅샷 캡처에 사용.
+
+생성 시점에는 records / pending 모두 비어 있으며, `record_*` 호출이
+순차적으로 데이터를 누적한다.
+
+### `MemoryStore()`
+
+생성 즉시 `self._storage: dict[str, dict[int, np.ndarray]] = {}` 단 하나
+의 필드 초기화. 두 단계 dict (`space → addr → ndarray`) 이며 lazy 하게
+필요한 space 가 생길 때마다 inner dict 가 채워진다.
+
+즉, **두 인프라의 첫 일은 "비어 있는 누적 buffer + space-별 sparse dict
+를 만들어 두는 것"** 이다. 첫 record / write 가 실제로 도착하면 그때
+필드가 채워지기 시작한다.
+
+## Context
+
+ADR-0020 (2-pass data execution) 의 D2/D5/D7 가 다음을 선언:
+
+- Phase 1 (timing) 동안 `ComponentBase._on_process_start/end` hook 이
+  `OpLogger.record_start/end` 를 호출하여 모든 data op 의 시간 + 메타
+  데이터를 기록.
+- Phase 2 (data) 가 op_log 를 t_start 순으로 재생하여 실 데이터 결과를
+  계산.
+- 데이터 페이로드 자체는 `MemoryStore` 에 (space, addr) 키로 보관.
+
+ADR-0023 D9 (IPCQ atomic write), ADR-0027 (Megatron TP scratch
+overwrite 회피), ADR-0046 D3.2 (`tl.store` visibility) 등 후속 ADR 들이
+op_log 와 MemoryStore 의 동작에 의존하지만, **정확한 record 필드 / space
+이름 / 스냅샷 시점** 은 코드 grep 으로만 확인 가능하다. 본 ADR 이 이를
+정리한다.
+
+## Decision
+
+### D1. `OpRecord` 스키마 — 7 개 필드
+
+```python
+@dataclass
+class OpRecord:
+    t_start: float
+    t_end: float
+    component_id: str
+    op_kind: str               # "memory" | "gemm" | "math" | "unknown"
+    op_name: str               # e.g. "dma_read", "gemm_f16", "exp",
+                               #     "TileToken/DMA_READ", "composite_gemm",
+                               #     "ipcq_copy"
+    params: dict[str, Any]
+    dependency_ids: list[int] = field(default_factory=list)
+```
+
+- **`t_start` / `t_end`**: SimPy 시간 (float ns). `t_start` 는 component
+  가 op 를 시작한 시점, `t_end` 는 완료 시점. duration = `t_end - t_start`.
+- **`component_id`**: op 가 발생한 node id (예:
+  `"sip0.cube0.pe0.pe_dma"`).
+- **`op_kind`**: 4 가지 중 하나. Phase 2 DataExecutor 가 이 값으로 분기.
+- **`op_name`**: 디버깅 / 분석용 사람-친화 이름. TileToken 일 경우
+  `"TileToken/{stage_type}"` (예: `"TileToken/DMA_READ"`) 로 stage 를
+  구분.
+- **`params`**: op-종속 메타데이터 dict (D3 참고).
+- **`dependency_ids`**: 현재 사용되지 않음 (default `[]`). 향후 cross-op
+  dependency 추적이 필요해질 때를 위한 자리.
+
+### D2. `OpLogger.records` — t_start 정렬 보장
+
+```python
+@property
+def records(self) -> list[OpRecord]:
+    self._records.sort(key=lambda r: r.t_start)
+    return self._records
+```
+
+매 접근 시 `t_start` 로 stable sort. 즉 같은 t_start 인 record 들은 insertion
+순서를 유지. ADR-0020 D5 의 "t_start stable ordering" 요구와 정합.
+
+Phase 2 DataExecutor 는 항상 `records` property 를 통해 접근하므로,
+record_end 호출이 t_start 와 다른 순서로 도착해도 (예: 짧은 op 가 긴
+op 보다 늦게 시작했으나 먼저 끝남) 재정렬되어 일관된 시퀀스를 받는다.
+
+### D3. op_name 별 `params` 스키마 (`_extract_op_info` 매핑)
+
+#### D3.1. `op_kind="memory", op_name="dma_read"` (DmaReadCmd)
+
+```python
+{"src_addr": int, "nbytes": int, "handle_id": str}
+```
+
+#### D3.2. `op_kind="memory", op_name="dma_write"` (DmaWriteCmd)
+
+```python
+{
+    "src_space": str,   # handle.space ("tcm"|"hbm"|"sram"), default "tcm"
+    "src_addr": int,    # handle.addr
+    "shape": tuple, "dtype": str,
+    "dst_space": "hbm", # DmaWrite 는 항상 HBM 으로
+    "dst_addr": int,
+    "nbytes": int,
+    "handle_id": str,
+    # record_end 시점에 src_space == "hbm" 이면 snapshot 추가 (D4)
+    "snapshot": np.ndarray | None,
+}
+```
+
+#### D3.3. `op_kind="gemm", op_name=f"gemm_{dtype_a}"` (GemmCmd)
+
+```python
+{
+    "src_a_addr": int, "src_b_addr": int, "dst_addr": int,
+    "shape_a": tuple, "shape_b": tuple, "shape_out": tuple,
+    "dtype_in": str, "dtype_out": str,
+    "m": int, "k": int, "n": int,
+    # ADR-0027: per-operand + output spaces 보존
+    "src_a_space": str, "src_b_space": str, "dst_space": str,
+}
+```
+
+#### D3.4. `op_kind="math", op_name=msg.op` (MathCmd; op = "exp", "sum", "add", "where" 등)
+
+```python
+{
+    "input_addrs": list[int],   # 입력 핸들들의 addr
+    "input_shapes": list[tuple],
+    "input_spaces": list[str],
+    "input_dtypes": list[str],
+    "dst_addr": int, "dst_space": str,
+    "shape_out": tuple, "dtype": str,
+    "axis": int | None,         # reduction 인 경우만 의미 있음
+    # record_end 시점에 모든 input 의 스냅샷이 채워짐 (D4)
+    "input_snapshots": list[np.ndarray | None],
+}
+```
+
+#### D3.5. `op_kind="gemm" or "math", op_name=f"composite_{op}"` (CompositeCmd)
+
+```python
+{
+    "op": str,              # "gemm" | "math"
+    "out_addr": int, "out_nbytes": int,
+    # op == "gemm" 인 경우 GemmCmd 와 같은 필드 추가:
+    "src_a_addr": int, "src_b_addr": int,
+    "shape_a": tuple, "shape_b": tuple,
+    "dtype_in": str, "dtype_out": str,
+    "src_a_space": str, "src_b_space": str,
+    "dst_space": "hbm", "dst_addr": int,  # = out_addr
+}
+```
+
+`op == "gemm"` 이면 `op_kind = "gemm"`, 아니면 `"math"`. Phase 2 측에서
+GemmCmd 와 동일 path 로 재생되도록 alias.
+
+#### D3.6. `op_kind="memory", op_name="ipcq_copy"` (record_copy 전용 경로)
+
+```python
+{
+    "src_space": str, "src_addr": int,
+    "dst_space": str, "dst_addr": int,
+    "shape": tuple, "dtype": str, "nbytes": int,
+    "snapshot": np.ndarray | None,   # 호출자가 전달, 없으면 record_copy 가 fresh read
+}
+```
+
+`PE_DMA._handle_ipcq_inbound` (ADR-0023 D9) 가 이 record 를 발사하여 IPCQ
+slot 의 inbound copy 를 Phase 2 가 재생 가능하게 한다. 이 record 는
+`record_start` / `record_end` 를 거치지 않고 직접 `record_copy()` 로 push.
+
+#### D3.7. `op_kind="unknown", op_name=type(msg).__name__`
+
+`_extract_op_info` 가 인식 못 한 message 의 fallback. params = `{}`.
+DataExecutor 가 이 op_kind 를 만나면 skip — Phase 2 replay 에 영향 없음.
+
+### D4. snapshot 캡처 시점
+
+`OpLogger._memory_store` 가 set 되어 있을 때 record_end 가 다음을 수행:
+
+- **math op**: 모든 input addr/shape/space/dtype 으로
+  `self._memory_store.read(...)` 를 호출하여 `params["input_snapshots"]` 에
+  ndarray copy 첨부. read 실패 시 None.
+- **dma_write op**: `src_space == "hbm"` 인 경우에만 source HBM 의
+  스냅샷을 `params["snapshot"]` 에 첨부. TCM source 는 **명시적으로
+  스킵** — TCM (PE scratch) 은 Phase 2 math/gemm 재생이 다시 채우므로,
+  Phase-1-time snapshot 을 잡으면 이전 kernel 의 stale 데이터를 잡을 위험
+  (ADR-0027 postmortem: TP gemm → all_reduce race).
+- **ipcq_copy**: `record_copy` 호출자가 `snapshot=token.data` 같이 in-flight
+  스냅샷을 전달. 없으면 record_copy 가 fresh read 로 대체 시도.
+
+스냅샷은 `.copy()` 가 호출되어 (`ndarray.copy()` 가 fresh allocation) 이후
+storage mutation 으로부터 안전. ADR-0027 의 "cross-PE Phase 2 ordering"
+race 회피의 근간.
+
+`memory_store` 가 None 인 경우 (Phase 1 timing-only 모드) 스냅샷 단계는
+전부 skip. record 의 timing 정보만 보존되며 데이터 replay 는 불가능.
+
+### D5. TileToken 처리 — record_start 가 stage 정보를 캡처
+
+ADR-0014 D6 의 self-routing tile token (pipeline 모드) 은 stage_idx 가
+record_end 시점에 이미 advance 되어 있을 수 있다 (TileToken 이 다음
+component 로 이동하면서 next stage 의 params 를 캐시). 따라서:
+
+`record_start` 가 다음을 `pending[id(msg)]["snap"]` 에 미리 저장:
+
+```python
+snap["stage_type"] = stage.stage_type.name        # "DMA_READ", "GEMM", 등
+snap["stage_params"] = dict(stage.params)         # 시점의 params 복사본
+```
+
+`record_end` 에서 이 snap 을 꺼내 params 에 merge:
+
+- `params["stage_type"]` 가 final params 에 추가.
+- `stage_params` 의 key 들이 (이미 있으면 보존) merge.
+- `op_name == "TileToken"` 이면 `op_name = f"TileToken/{stage_type}"` 로
+  rewrite (예: `"TileToken/DMA_READ"`) — 같은 component 에서 발생한 서로
+  다른 stage 의 record 를 disambiguate.
+
+이 메커니즘 덕분에 DMA_READ vs DMA_WRITE, FETCH vs STORE 가 같은 component
+(예: pe_dma) 에서 발생하더라도 reporting 측에서 구분 가능.
+
+### D6. `MemoryStore` — (space, addr) 두 단계 dict
+
+```python
+class MemoryStore:
+    def __init__(self) -> None:
+        self._storage: dict[str, dict[int, np.ndarray]] = {}
+
+    def write(self, space, addr, data): self._storage[space][addr] = data
+    def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ...
+    def has(self, space, addr) -> bool: ...
+    def snapshot(self) -> MemoryStore: ...
+```
+
+#### D6.1. space namespace
+
+문자열 키. 표준 값:
+
+- `"hbm"`: HBM 데이터 (deploy_tensor + Phase 2 dma_write 결과).
+- `"tcm"`: PE-로컬 TCM (Phase 2 math/gemm 결과).
+- `"sram"`: cube-level SRAM (ADR-0023 D9.7 IPCQ slot tier).
+
+다른 space (예: `"reg"`) 도 자유롭게 허용 — `_storage` 가 lazy dict 라
+새 space 가 write 호출과 함께 자동 생성.
+
+#### D6.2. address keying
+
+`addr` 는 정수. **physical address (PA) 또는 virtual address (VA)** 일 수
+있다 — MemoryStore 자체는 address space 의 의미를 모르고 그저 키로 쓴다.
+Phase 1 의 `MemoryWriteMsg` 는 PA + VA 둘 다 write (`_create_tensor` 에서
+PA 로 zero-init, VA base 로도 zero-init), Phase 2 는 op_log 가 captured
+한 address 로 read/write.
+
+`addr` 의 의미는 호출자가 결정한다 — `MemoryStore` 는 lookup 만 제공.
+
+#### D6.3. read/write 의미 — reference store (no copy)
+
+`write(space, addr, data)`: `data` ndarray 의 reference 를 저장. **copy
+하지 않음**. 호출자가 같은 ndarray 를 이후 mutate 하면 stored value 도
+변경된다.
+
+`read(space, addr, shape=None, dtype=None)`: 저장된 ndarray 의 reference
+반환. `shape` 또는 `dtype` 이 제공되면:
+
+- `dtype != stored.dtype`: `arr.view(np_dtype)` 로 reinterpret cast (no
+  copy).
+- `shape != stored.shape`: `nbytes` 가 일치하면 `arr.reshape(shape)` (view).
+- `nbytes` 불일치: `ValueError`.
+
+데이터를 안전하게 분리하려면 호출자가 `arr.copy()` 호출. ADR-0027 의
+race 회피가 op_log snapshot 단계에서 명시적 copy 를 강제하는 이유.
+
+#### D6.4. `has(space, addr) -> bool`
+
+해당 키의 존재 여부만 확인. 데이터 인스턴스화는 안 함.
+
+#### D6.5. `snapshot() -> MemoryStore`
+
+shallow copy. inner dict 의 새 인스턴스를 만들되 ndarray reference 는
+공유. Phase 2 초기화 시점에 Phase 1 의 store 를 fork 하여 Phase 2 의
+mutation 이 Phase 1 의 다른 사용처에 영향을 주지 않게 분리하는 데 사용.
+
+### D7. op_log 가 SimPy 단일-스레드를 가정한다
+
+`OpLogger` 의 `_records`, `_pending` 은 lock 없이 사용. SimPy 가 single-
+threaded 라 `record_start` → `record_end` 사이에 다른 thread 가 끼어들
+수 없다는 가정.
+
+향후 multi-process kernbench (ADR-0047 D6) 가 도입되면 OpLogger 도 process
+별로 분리되어야 함이 명시. 단일 OpLogger 인스턴스가 multiple process 의
+record 를 받지 못한다.
+
+## Alternatives Considered
+
+### A1. op_log 를 SQLite / parquet 같은 외부 store 로
+
+기각 (현재). in-memory list 가 Phase 1 → Phase 2 의 핸드오프 latency 를
+최소화한다. 외부화는 long-running batch run 에서 의미가 있겠으나, 현재
+single-run 워크로드 에서는 overhead 만 추가.
+
+### A2. snapshot 을 record_start 시점에 캡처
+
+기각. record_start 시점은 input 이 아직 채워지지 않은 상황 (예: math
+op 의 input 이 직전 op 의 output 일 때) 이 흔하다. record_end 가 정확한
+시점.
+
+### A3. MemoryStore 를 component-별 store 로 분리
+
+기각. (space, addr) 키가 이미 충분히 disambiguation 을 제공하며, component
+별 분리는 cross-PE IPCQ copy (ADR-0023 D9) 가 source/destination 양쪽
+store 를 접근해야 하는 케이스를 복잡하게 만든다.
+
+### A4. op_log 에 cross-op dependency edge 명시
+
+부분 채택. `dependency_ids` 필드가 OpRecord 에 자리 잡고 있지만 현재
+사용되지 않음 (D1). Phase 2 DataExecutor 가 t_start 정렬 + secondary sort
+(memory ops before math at same t_start) 로 ordering 을 결정하며, 명시적
+dependency graph 가 필요해지면 이 필드가 채워질 자리. 현재는 ordering rule
+이 충분하므로 미사용.
+
+## Consequences
+
+- ADR-0020 의 op_log / MemoryStore 선언이 D1–D6 의 구체 schema 로 확장
+  되어, Phase 2 DataExecutor 작성/수정 시 정확한 필드 의미를 grep 없이
+  ADR 에서 확인 가능.
+- D3 의 op_name 별 params 스키마가 명시되어, 새 op (예: 새 reduction
+  type) 추가 시 `_extract_op_info` 분기 어디에 끼울지 명확.
+- D4 의 snapshot 시점 차이 (math = input snapshot, dma_write = HBM-only
+  snapshot) 가 ADR 에 굳어져, ADR-0027 의 cross-PE race 회피 결정이 향후
+  refactor 에서 silently 깨지지 않음.
+- D6.3 의 reference-store 의미가 명시되어, 호출자가 mutation safety 책임
+  을 인지. ADR-0027 의 explicit `.copy()` 패턴이 정당화됨.
+- D7 의 single-thread 가정이 명시되어, multi-process kernbench (ADR-0047
+  D6 supersession 후보) 도입 시 OpLogger 분리가 필요함이 분명.
@@ -0,0 +1,307 @@
+# ADR-0053: Topology Builder + Visualizer Algorithms
+
+## Status
+
+Accepted (2026-05-22).
+
+`topology/builder.py`, `topology/mesh_gen.py`, `topology/visualizer.py` 가
+함께 수행하는 토폴로지 컴파일·시각화 파이프라인의 핵심 알고리즘 선택
+(placement-driven router attachment, mesh auto-layout, source_hash 캐시,
+view projection, SVG rendering) 을 명시한다. ADR-0006 가 topology
+compilation 의 high-level intent (compiled topology, distance extraction,
+automatic diagram generation) 를 정의하나, **builder 가 실제로 어떤
+알고리즘을 사용하는지** 는 코드 grep 으로만 확인 가능했다.
+
+## First action (제일 처음에 하는 일)
+
+`resolve_topology(path_str)` 가 호출되면 다음 4 단계가 순서대로 일어난다:
+
+1. **경로 검증** (`builder.py::resolve_topology`):
+   `Path(path_str).expanduser().resolve()`, 존재 확인, file 여부 확인.
+   실패 시 `FileNotFoundError` 또는 `ValueError`.
+2. **YAML 파싱** (`_read_spec`): `yaml.safe_load`. parse error 면 line/
+   column 정보 포함한 `ValueError`. dict 가 아니면 reject.
+3. **mesh 자동 생성** (`mesh_gen.ensure_mesh_file`): topology yaml 과
+   같은 디렉터리에 `cube_mesh.yaml` 을 만들거나 (캐시 invalid 시) 재사용
+   (캐시 hit 시). 이 단계가 cube NoC 의 라우터 grid 와 부착 정보를 결정.
+4. **graph 컴파일** (`_compile_graph`): system → IO chiplets → cubes →
+   inter-cube edges → IO↔cube edges → system↔IO edges 순으로 nodes/edges
+   를 누적, 그 다음 4 개의 view projection (system, sip, cube, pe) 을
+   생성하여 `TopologyGraph` 로 묶음.
+
+즉, **topology compile 의 첫 일은 "topology.yaml 을 dict 로 읽고, 동일
+디렉터리에 cube_mesh.yaml 을 생성/검증한 뒤, system→sip→cube→pe 순으로
+flat graph + 4-view projection 을 만드는 것"** 이다.
+
+## Context
+
+`topology/` 패키지의 책임:
+
+- **builder.py** (1207 줄): topology.yaml 을 받아 `TopologyGraph` (nodes
+  + edges + 4 view projections) 를 컴파일.
+- **mesh_gen.py** (305 줄): cube NoC 의 라우터 grid 와 PE/UCIe/M_CPU/SRAM
+  부착 위치를 자동 결정하여 `cube_mesh.yaml` 로 캐시.
+- **visualizer.py** (887 줄): `TopologyGraph` 로부터 SVG 다이어그램 4종
+  (system / sip / cube / pe) 을 생성.
+
+ADR-0006 가 "topology compilation 의 결과는 distance metadata 와 diagram
+generation 의 single source" 라는 high-level 결정을 정의하나, 구체 알고리즘
+(예: placement-driven nearest-router attachment, HBM 제외 zone 산출,
+source_hash 의 어떤 필드가 invalidation 을 트리거하는가) 은 ADR 에 없다.
+
+특히 다음 결정들이 ADR-level 에 부재:
+
+- 왜 mesh_gen 이 별도 파일 (`cube_mesh.yaml`) 로 캐시되는가?
+- source_hash 가 어떤 필드를 포함하며, 어떤 변경이 재생성을 강제하는가?
+- placement coordinate 가 cube 좌표가 아닌 mm 단위인 이유?
+- HBM zone 제외와 UCIe N/S/E/W 분배가 mesh 안에서 어떻게 결정되는가?
+- view projection 4 개 (system/sip/cube/pe) 의 추상화 레벨 차이?
+
+이 ADR 이 이 결정들을 한 곳에 정리한다.
+
+## Decision
+
+### D1. compile 파이프라인 — 6 단계
+
+`_compile_graph(spec)`:
+
+1. **시스템 노드 생성** (`_instantiate_system`): `fabric.switch0`, host CPU
+   등 system-level 노드 추가.
+2. **per-SIP loop** (`for sip_id in range(system.sips.count)`):
+   - **IO chiplets** (`_instantiate_io_chiplets`): pcie_ep / io_cpu /
+     io_noc / io_ucie PHY / conn 노드 + 내부 양방향 edge 생성.
+   - **cube instantiation** (`_instantiate_cube`): cube_mesh.yaml 의 router
+     grid 를 토대로 cube-별 라우터, PE sub-components (pe_cpu, pe_dma,
+     pe_fetch_store, pe_gemm, pe_math, pe_mmu, pe_tcm, pe_scheduler,
+     pe_ipcq), m_cpu, sram, hbm_ctrl 인스턴스화 + 내부 edge 깔기.
+   - **inter-cube edges** (`_add_inter_cube_edges`): UCIe N/S/E/W mesh
+     edge.
+   - **IO ↔ cube edges** (`_add_io_to_cube_edges`): io_noc 와 cube 의
+     edge UCIe phy 사이 연결.
+3. **switch ↔ IO edges** (`_add_system_to_io_edges`): `fabric.switch0`
+   와 각 SIP 의 `pcie_ep` 사이 양방향 edge (ADR-0038 D3 + ADR-0010 의
+   cross-SIP IPCQ 경로).
+4. **view projections** 4 종 build:
+   - `_build_system_view(spec)` — Tray 레벨, SIP 들과 system switch.
+   - `_build_sip_view(spec)` — SIP 안의 cube mesh + IO chiplet.
+   - `_build_cube_view(spec)` — 단일 cube 안의 router grid + PE/M_CPU/SRAM/
+     HBM_CTRL 부착.
+   - `_build_pe_view(spec)` — 단일 PE 안의 9 sub-components + 내부 edge.
+5. **TopologyGraph 리턴**: `TopologyGraph(spec, nodes, edges, system_view,
+   sip_view, cube_view, pe_view)`.
+
+이 6 단계는 **순서가 의미를 가진다**: cubes 가 만들어진 후에야 inter-cube
+edges 가 valid 한 src/dst 를 갖고, IO chiplet 이 먼저 만들어져야 IO ↔ cube
+edge 가 그를 참조할 수 있다. 새 노드 종류를 끼울 때는 의존 관계를 보고
+적절한 위치에 삽입해야 한다.
+
+### D2. `cube_mesh.yaml` — 별도 파일 + source_hash 캐시
+
+`mesh_gen.ensure_mesh_file(cube_spec, mesh_path)`:
+
+1. `source_hash = _compute_source_hash(cube_spec)` 산출. 입력 필드:
+   - `geometry` (cube_mm.w/h 등).
+   - `pe_layout` (corners, pe_per_corner).
+   - `ucie.n_connections`.
+   - `memory_map.hbm_mapping_mode`.
+   - `placement` (m_cpu/sram pos_mm).
+2. `mesh_path` (= `topology.yaml` 와 같은 디렉터리의 `cube_mesh.yaml`) 이
+   존재하고 `existing.source_hash == source_hash` 면 재사용 (캐시 hit).
+3. 아니면 `_generate_mesh(cube_spec, source_hash)` 로 새 mesh 생성 후
+   yaml 로 저장.
+
+별도 파일로 캐시하는 이유:
+
+- mesh 생성은 PE/UCIe/router 부착 계산이 들어가 매번 다시 하기 무거움.
+- 같은 cube spec 으로 여러 번 실행 시 동일 mesh 가 보장되어야 함.
+- 사람이 직접 mesh 를 inspect / debug 할 수 있는 artifact 가 됨.
+
+`source_hash` 가 list 한 5 개 필드가 mesh 형상을 결정하는 핵심이며, 그
+외 (예: bandwidth, overhead_ns) 변경은 mesh 재생성을 트리거하지 않는다.
+
+### D3. cube NoC mesh auto-layout 알고리즘
+
+`_generate_mesh(cube_spec)`:
+
+#### D3.1. 행/열 결정
+
+- `pe_positions = _corner_pe_positions(cube_w, cube_h)`: 4 corner (NW/NE/
+  SW/SE) 마다 PE center 좌표 (mm). hardcoded `(1.5, 1.5)` / `(cube_w-1.5,
+  cube_h-1.5)` 패턴 + `pe_per_corner=2` 면 각 corner 에 2 PE 위치.
+- `col_xs = _compute_col_positions(...)`: PE 들의 x 좌표 union + `max_spacing
+  = 3.0 mm` 보다 큰 gap 에 relay 컬럼 삽입.
+- `row_ys, rows_per_half = _compute_row_positions(cube_h, n_connections,
+  pe_positions)`:
+  - `n_conn = max(n_connections, 2)` (hot path minimum).
+  - `rows_per_half = ceil(n_conn / 2)`.
+  - top 절반 + HBM 두 row + bottom 절반. HBM 은 `(cube_h/2 - 1.5, cube_h/2
+    + 1.5)` 에 위치. PE rows 와 HBM rows 사이 `hbm_gap = 1.5 mm`.
+
+#### D3.2. HBM 제외 zone
+
+`hbm_row_start = rows_per_half`, `hbm_row_end = rows_per_half + 1`.
+`hbm_col_start = n_cols // 2 - 1`, `hbm_col_end = n_cols // 2`.
+
+이 (row, col) 사각형 안의 router 슬롯은 `None` 으로 마킹 (라우터 없음).
+실제 HBM 컨트롤러는 별도 `hbm_ctrl.pe{X}` 노드로 ADR-0017 D9 의 per-PE
+파티션 패턴을 따라 부착.
+
+#### D3.3. PE 부착
+
+각 corner 의 PE 들은 다음 row 에 매핑:
+
+- Top half: NW → row 0, NE → row 1 (top_corners 안의 index).
+- Bottom half: SW → row `hbm_row_end + 1`, SE → row `hbm_row_end + 2`.
+
+각 PE 의 x 좌표가 가장 가까운 col 의 router 에 부착 (`min(range(n_cols),
+key=lambda c: abs(col_xs[c] - pe_x))`). 부착 항목은 `pe{pe_idx}.dma`,
+`pe{pe_idx}.cpu`, `pe{pe_idx}.hbm` 세 가지 (router 별 attach list 에 push).
+
+#### D3.4. M_CPU / SRAM 부착 — nearest router by Euclidean distance
+
+`placement.m_cpu.pos_mm` (default `[1.5, 5.5]`) 와 `placement.sram.pos_mm`
+(default `[1.5, 8.5]`) 의 좌표에서 가장 가까운 router 를 Euclidean
+distance 로 찾아 attach list 에 `"m_cpu"` / `"sram"` 추가.
+
+#### D3.5. UCIe N/S/E/W 분배
+
+`ucie_pe_rows = top_pe_rows + bot_pe_rows` (총 `2 * rows_per_half` 개).
+
+- UCIe-E: 매 PE row 마다 rightmost col 의 router 에 `ucie_e.c{i}`.
+- UCIe-W: leftmost col 의 router 에 `ucie_w.c{i}` (E 의 mirror).
+- UCIe-N/S: PE column 들 중 절반을 좌측, 절반을 우측으로 나눠 top row /
+  bottom row 의 해당 col 에 부착.
+
+각 UCIe connection 은 `c{i}` index 가 붙어 ucie_n_connections 만큼의 PHY
+가 분산된다 (ADR-0017 D5+).
+
+### D4. node 명명 규칙 — 단일 소유자
+
+builder.py 는 다음 명명 규칙으로 노드를 만든다 (ADR-0051 D5 의 단일
+소유자 원칙):
+
+- `fabric.switch0` — system-level switch.
+- `sip{S}.{io_id}.{pcie_ep|io_cpu|io_noc|io_ucie.{dir}|conn.{id}}` — IO
+  chiplet.
+- `sip{S}.cube{C}.{m_cpu|sram|hbm_ctrl.pe{X}|noc.r{R}c{C}|...}` — cube 내부.
+- `sip{S}.cube{C}.pe{P}.{pe_cpu|pe_dma|pe_fetch_store|pe_gemm|pe_math|pe_mmu|pe_tcm|pe_scheduler|pe_ipcq}` — PE sub-components.
+
+이 명명 규칙을 변경하려면 builder.py 와 router.py (ADR-0051) 의 helper
+양쪽이 함께 갱신되어야 한다. 컴포넌트는 명명 규칙을 직접 알지 못하고
+helper 만 호출한다.
+
+### D5. edge `kind` 분류
+
+각 edge 가 부여받는 `kind` 가 라우팅 정책 (ADR-0051 D2) 의 입력. 주요
+kind 값:
+
+- `"pe_internal"` — PE 내부 sub-component 간.
+- `"pe_to_router"` — PE_DMA ↔ cube NoC router.
+- `"router_mesh"` — cube NoC router 간.
+- `"router_to_hbm"`, `"router_to_mcpu"`, `"router_to_sram"`,
+  `"sram_to_router"` 등 — cube-attached component 간.
+- `"ucie_internal"`, `"ucie_conn_to_router"`, `"router_to_ucie_conn"`,
+  `"ucie_conn_to_noc"`, `"noc_to_ucie_conn"`, `"ucie_mesh"` — UCIe 관련.
+- `"io_internal"` — IO chiplet 내부.
+- `"io_to_cube"`, `"cube_to_io"` — IO ↔ cube 경계.
+- `"pcie"` — switch ↔ pcie_ep.
+- `"command"` — control-plane only edges (M_CPU ↔ NOC 등; PE DMA path 에서
+  제외).
+
+새 edge kind 를 추가하면 router.py 의 4 adjacency graph (ADR-0051 D2) 의
+어느 카테고리에 속할지 결정해야 한다 — 그렇지 않으면 default 로 `_adj_all`
+에만 포함되어 의도와 다른 routing 발생 가능.
+
+### D6. view projection — 4 추상화 레벨
+
+`TopologyGraph` 는 flat (nodes + edges) 외에 4 개의 view projection 을
+보유:
+
+- **system_view** (`_build_system_view`): Tray 레벨. SIP 박스들 + `fabric.
+  switch0`. PCIE 링크 표시. 외부 발표용 high-level overview.
+- **sip_view** (`_build_sip_view`): 한 SIP 안. cube mesh + IO chiplet
+  (pcie_ep + io_cpu + io_noc). UCIe N/S/E/W 가 cube 간 연결로 보임.
+- **cube_view** (`_build_cube_view`): 한 cube 안. router grid + PE/M_CPU/
+  SRAM/HBM_CTRL 부착 + UCIe PHY edge 부분. cube 내부 라우팅 / placement
+  진단용.
+- **pe_view** (`_build_pe_view`): 한 PE 안. 9 sub-components + 내부 edge
+  (pe_internal kind). 자세한 PE 내부 dataflow 검토용.
+
+view 는 spec 에서 `visualization.emit_views: [system, sip, cube]` 같이
+선택적으로 출력 (ADR-0006). pe view 는 기본 출력에서 빠져 있으나 코드는
+유지 (자세한 디버그용).
+
+### D7. visualizer.py — SVG 다이어그램 출력
+
+`emit_diagrams(graph, out_dir)` 가 모든 view 를 SVG 로 렌더. 핵심 함수:
+
+- `_render_view_svg(view)` — 일반적인 view 렌더 (router grid 가 없는
+  경우).
+- `_render_cube_view_svg(view, spec)` — cube view 전용 (HBM block 그리기,
+  router grid layout, PE/M_CPU/SRAM/HBM positioning).
+- `_draw_node`, `_draw_edge` — 노드 / edge 의 시각적 표현.
+- `_pick_scale`, `_compute_node_sizes` — 자동 스케일링.
+
+visualizer 는 **derived artifact** (ADR-0006) 로 분류되며, 코드 변경 시
+production check 대상이 아니다. CLAUDE.md 의 "Derived Artifacts" 항목과
+정합.
+
+### D8. spec 변경의 영향 범위
+
+| spec 필드                             | 영향              | mesh 재생성 |
+|---------------------------------------|-------------------|-------------|
+| `system.sips.count`                   | SIP 갯수, node 수 | No          |
+| `sip.cube_mesh.w/h`                   | cube mesh 형상    | No          |
+| `cube.geometry.cube_mm.w/h`           | cube 크기 (mm)    | **Yes**     |
+| `cube.pe_layout.corners/pe_per_corner`| PE 부착 위치       | **Yes**     |
+| `cube.ucie.n_connections`             | UCIe PHY 분배     | **Yes**     |
+| `cube.memory_map.hbm_mapping_mode`    | HBM 분배 모드     | **Yes**     |
+| `cube.placement`                      | M_CPU/SRAM 위치   | **Yes**     |
+| `cube.memory_map.*` (위 제외)         | HBM 용량 / BW     | No          |
+| `*.links.*.bw_gbs`                    | edge bandwidth    | No          |
+| `*.attrs.overhead_ns`                 | 컴포넌트 latency  | No          |
+
+위 표가 D2 의 `_compute_source_hash` 입력과 일치. mesh 재생성이 필요한
+변경은 `cube_mesh.yaml` 의 source_hash 가 자동 invalidate.
+
+## Alternatives Considered
+
+### A1. mesh 를 별도 캐시 파일 없이 매 compile 시 재생성
+
+기각. 같은 spec 으로 여러 번 호출되는 케이스 (CLI run, probe, test) 마다
+mesh 생성 비용을 다시 지불. 또한 사람이 mesh 를 inspect 할 수 있는 artifact
+가 사라짐.
+
+### A2. mesh 생성을 builder.py 에 합치기
+
+기각 (현재). 305 줄 짜리 자체 알고리즘이며, mesh layout 의 결정 (placement-
+driven router attachment, HBM exclusion zone) 이 builder 의 일반적인
+node/edge 생성 책임과 다르다. 분리 유지가 단일 책임 원칙에 더 부합.
+
+### A3. placement coordinate 를 cube 좌표 (col/row) 로 표현
+
+기각. mm 단위 좌표가 시각화 측 (visualizer) 과 mesh layout 측 (nearest-
+router 산출) 양쪽에서 일관되게 쓰인다. cube 좌표는 router grid 가 결정
+되기 전까지는 정의되지 않으므로 placement 입력에 부적절.
+
+### A4. view projection 을 lazy 하게 생성
+
+기각 (현재). 4 개 view 의 생성 비용이 작고 (보통 < 100 ms), eager 생성이
+`TopologyGraph` 를 통한 single source of truth 를 보장.
+
+### A5. visualizer 출력 형식을 SVG 외 (PNG/PDF) 도
+
+기각. SVG 가 vector + 텍스트 검색 가능 + 브라우저 직접 렌더가 가능한 가장
+유연한 형식. PNG 변환이 필요하면 별도 도구 (rsvg-convert 등) 로 후처리.
+
+## Consequences
+
+- ADR-0006 의 high-level intent 가 D1–D7 로 구체화되어, topology 변경
+  영향을 D8 표로 빠르게 가늠 가능.
+- D3 의 mesh auto-layout 알고리즘이 ADR-level 에서 굳어져, 추후 새 PE
+  부착 패턴 (예: HBM 의 6-zone 분할) 도입 시 어느 단계가 영향받는지 명확.
+- D5 의 edge kind 목록과 D7 의 view 구조가 명시되어, 새 component 종류
+  추가 시 (builder + router + visualizer) 어디까지 손대야 하는지 PR
+  reviewer 가 한눈에 파악 가능.
+- D2 의 source_hash invalidation 규칙이 명시되어, cube_mesh.yaml 이 stale
+  하게 남는 경우 (예: bw 값만 바꿨을 때) 가 정상 동작임이 분명.
@@ -0,0 +1,138 @@
+# ADR-0054: 마일스톤 평가 bench — 자기완결적 sweep + figure bench
+
+## Status
+
+Accepted (2026-05-22).
+
+ADR-0044(D1/D2)와 ADR-0045(D5)를 개정하고, ADR-0043/0044의 "로직이
+`scripts/` + `tests/`에 산다" 배치를 대체한다: GEMM/allreduce 평가
+하니스가 이제 사용자가 실행하여 모든 결과 + figure를 재생성하는
+자기완결적 **bench**가 된다.
+
+## Context
+
+ADR-0043(allreduce 평가)과 ADR-0044(GEMM 평가)는 각 하니스를 **sweep**
+(수동 `scripts/` 드라이버, 또는 allreduce의 경우 parametrized 테스트
+자체) + committed 데이터를 렌더링하는 **figure 테스트**로 분리했다.
+따라서 sweep/render 로직은 `scripts/gemm_sweep.py`,
+`tests/gemm/_gemm_plot_helpers.py`, `tests/sccl/_allreduce_helpers.py`에
+존재했다.
+
+마일스톤 요구사항("사용자가 *하나의 bench*를 실행해 모든 결과와 플롯을
+생성하도록 allreduce + GEMM 평가를 리팩터")은 그 배치로는 충족 불가다:
+bench는 production 코드이며 **`tests/`를 import할 수 없다**(ADR-0007 레이어
+방향). 평가 로직은 bench에서 닿을 수 있도록 production으로 이동해야 했다.
+
+선택한 home은 별도 `kernbench.eval` 패키지가 아니라 bench 모듈 자체다.
+bench 파일은 임의의 모듈 레벨 코드를 가질 수 있으며, 하니스를 bench로
+합치면 도메인당 파일 하나가 유지되고 패키지 레이어가 하나 줄어든다.
+
+## Decision
+
+### D1. 두 마일스톤 bench가 평가 로직을 보유
+
+- `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep
+  + 세 figure renderer(`scripts/gemm_sweep.py` +
+  `tests/gemm/_gemm_plot_helpers.py`에서 이동).
+- `src/kernbench/benches/milestone_1h_ccl.py` — distributed allreduce
+  드라이버, latency + buffer-kind sweep, topology diagram, FSIM 비교, 그리고
+  direct-launch 패리티 레퍼런스(`tests/sccl/_allreduce_helpers.py`에서 이동).
+
+각 파일은 해당 도메인 평가 로직의 **단일 home**이다.
+
+### D2. "평가 bench" 패턴 (ADR-0045 D5 확장)
+
+ADR-0045 D5는 bench를 단일 구성(single-SIP, 또는 ADR-0024 multi-SIP CCL
+예외)으로 고정했다. 본 ADR은 세 번째 패턴을 추가한다:
+
+- **평가 bench**는 *여러* 구성을 구동하고 figure를 렌더링할 수 있다. 외부
+  `run_bench` 엔진 대신 sweep 지점마다 자체 `GraphEngine` /
+  `RuntimeContext`를 빌드한다.
+- 그러면 외부 ctx에 제출된 handle이 없으므로, bench는 마지막에
+  **sentinel 텐서**(`torch.zeros((1, 1), …)`)를 제출하여 `run_bench`의
+  "최소 한 번 제출" 계약(ADR-0045 D4)을 만족시키고 CLI가 0으로 종료되게
+  한다.
+
+### D3. 출력 위치
+
+두 bench 모두 `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`에
+쓴다(사용자 요청 — bench 옆 아티팩트). 디렉터리는 생성된 PNG/CSV/JSON만
+보유하며(`.py`/`__init__.py` 없음), 따라서 eager-import audit(ADR-0045
+첫 동작)이 무시한다 — `pkgutil.iter_modules`는 비-패키지 하위 디렉터리를
+yield하지 않는다. `docs/diagrams/` 아티팩트처럼 **커밋된다**(원격에서
+figure를 볼 수 있도록); bench 재실행 시 제자리에서 재생성된다.
+
+### D4. GEMM 무거운 sweep — 기본은 fresh, `MILESTONE_FAST`로 재사용
+
+`milestone-1h-gemm`은 기본적으로 전체 24-sim sweep을 실행한다(분 단위;
+한 shape는 2048 tile). `MILESTONE_FAST=1`은 committed
+`docs/diagrams/gemm_sweep.json`을 재사용하고 렌더링만 한다(초 단위). 이는
+ADR-0044 D1/D2의 "무거운 sweep은 수동/`slow` 단계로 유지"를 뒤집는다:
+bench 실행이 곧 재생성이다. slow 경로는 `@pytest.mark.slow` bench
+테스트로 행사되고, fast 경로는 기본 실행된다.
+
+### D5. 테스트 + 스크립트는 thin re-export shim으로 재사용 (단일 home 유지)
+
+기존 figure 테스트와 `scripts/gemm_sweep.py` 진입점은 유지되며 이제 bench
+모듈을 재사용한다:
+
+- `tests/gemm/_gemm_plot_helpers.py` → renderer +
+  `GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT`를
+  `kernbench.benches.milestone_1h_gemm`에서 re-export.
+- `tests/sccl/_allreduce_helpers.py` → 드라이버 코어, config writer, sweep
+  상수, renderer, disk aggregator를 `kernbench.benches.milestone_1h_ccl`에서
+  re-export하고, **pytest 전용** 조각은 로컬 유지: `pytest.param` 행렬
+  (`CONFIGS` / `_sweep_params` / `_bk_params`)과 fixture 결합
+  `_run_distributed`(`monkeypatch.chdir` + `_drive_distributed`) wrapper.
+- `scripts/gemm_sweep.py` → bench의 `run_sweep` 위 thin wrapper.
+
+테스트가 bench 모듈을 import하는 것은 허용된다(테스트는 production 위에
+위치, ADR-0007); 이는 전체 패키지 eager audit을 유발하며, 그것은 이미 매
+`kernbench` 실행 시 동작한다. matplotlib는 renderer 내부에서 lazy import로
+유지되어 audit의 startup 비용은 불변이다.
+
+### D6. 평면 모듈 네이밍 (`benches/` 하위 폴더 없음)
+
+`1H_milestone…`로 명명된 `benches/` 하위 패키지는 불가능하다 — Python
+패키지 이름은 숫자로 시작할 수 없다. 따라서 bench는 평면 모듈
+`milestone_1h_gemm.py` / `milestone_1h_ccl.py`이며 bench 이름은
+`milestone-1h-gemm` / `milestone-1h-ccl`(kebab-case, ADR-0045 D1에 따라
+글자로 시작)이다.
+
+## Consequences
+
+### Positive
+
+- `kernbench run --bench milestone-1h-gemm`(또는 `…-ccl`)이 도메인의 모든
+  결과 + figure를 한 명령으로 재생성한다 — 마일스톤 요구사항.
+- 평가 로직의 단일 소스(bench), shim을 통해 테스트와 스크립트가 재사용;
+  중복 없음.
+- figure 테스트와 `scripts/gemm_sweep.py`는 변경 없이 계속 동작.
+
+### Negative / limitations
+
+- 두 bench 파일이 크다(CCL 쪽은 distributed 드라이버, sweep, matplotlib
+  드로잉을 섞는다). 대부분 평가 하니스인 "bench"는 이례적이며, 본 ADR이
+  이를 정당화한다.
+- 생성 아티팩트가 명시적 요청에 의해 source tree(`src/kernbench/benches/`)
+  안에 살며 커밋된다(원격에서 figure를 볼 수 있도록); bench 재실행 시
+  재생성된다.
+- `milestone-1h-ccl`(및 기본 `milestone-1h-gemm`)은 분 단위 소요 —
+  on-demand 마일스톤 아티팩트에는 수용 가능, 일상 실행에는 아님.
+
+## Dependencies
+
+- **ADR-0007**: 레이어 방향(테스트는 production을 import할 수 있으나 bench는
+  테스트를 import할 수 없는 이유).
+- **ADR-0043 / ADR-0044**: 본 ADR이 bench로 이전하는 allreduce / GEMM 평가
+  하니스.
+- **ADR-0045**: bench 모듈 계약; 여기 D2가 그 D5(single-device 규칙)를
+  평가-bench 패턴으로 확장하고, sentinel을 위해 D4(NO_REQUESTS)에 의존.
+- **ADR-0024**: allreduce sweep이 구동하는 rank = SIP launcher.
+
+## Open questions
+
+- GEMM theoretical 모델 상수(ADR-0044 D5)를 복사 대신 ADR-0033/0014에서
+  소싱해야 하는가? 본 ADR로는 불변.
+- `build_overview_slides.py`가 GEMM 막대를 네이티브로 그리는 대신 마일스톤
+  출력 PNG를 소비해야 하는가? 여전히 open(ADR-0044 D6 / Negative).
@@ -0,0 +1,175 @@
+# ADR Index
+
+Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **47**.
+
+Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
+
+## Design Principles
+
+- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — 검증 전략 및 Phase 1 테스트 계획
+- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — 레이턴시 모델: 가정 및 알려진 단순화
+
+## High-level Architecture
+
+- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — 타겟 시스템 계층 및 모델링 범위  _(System hierarchy (Tray / SIP / CUBE / PE))_
+- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — 런타임 API 및 시뮬레이션 엔진 경계  _(Runtime API ↔ sim_engine boundaries)_
+- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NoC와 메모리 데이터 경로  _(IOChiplet NOC and memory data path)_
+- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — 큐브 NoC와 HBM 연결성  _(Cube NOC and HBM connectivity)_
+
+## Detailed Architecture
+
+One subsection per component file under `src/kernbench/components/builtin/`.
+
+### forwarding
+
+- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding 컴포넌트 (forwarding_v1)
+
+### hbm_ctrl
+
+- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM 컨트롤러 내부 설계
+
+### io_cpu
+
+- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU 컴포넌트 모델
+
+### m_cpu
+
+- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU 및 M_CPU.DMA 컴포넌트 모델
+
+### pcie_ep
+
+- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
+
+### pe_cpu
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_dma
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
+
+### pe_fetch_store
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_gemm
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_ipcq
+
+- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
+
+### pe_math
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_mmu
+
+- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할
+
+### pe_scheduler
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_tcm
+
+- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — 듀얼 채널 BW 직렬화
+
+### sram
+
+- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
+
+### tiling
+
+- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더
+
+## Implementation Decisions
+
+### Address Scheme
+
+- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51비트 물리 주소 레이아웃 및 디코딩 계약
+- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — 메모리 주소 지정 — PA / VA / LA 주소 모델
+
+### Routing & Helper API
+
+- [ADR-0002](./ADR-0002-lat-routing-distance.md) — 라우팅 거리, 순서 및 우회 규칙
+- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
+
+### Memory Semantics & Local-HBM Bandwidth
+
+- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — 메모리 시맨틱 및 로컬 HBM 대역폭 보장
+
+### Topology Compilation, Diagrams & Builder Algorithms
+
+- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — 다이어그램 뷰 및 거리 기반 레이아웃 규칙
+- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — 토폴로지 컴파일, 거리 추출, 그리고 자동 다이어그램 생성
+- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
+
+### Tensor Deployment and Allocation
+
+- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — 텐서 배포 및 할당 (호스트 할당기, PA 우선)
+
+### Kernel Execution and Host-Device Messaging
+
+- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — 커널 실행 메시징 및 완료 시맨틱
+- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU 메시지 스키마 (PA-우선, PE-태깅)
+
+### CLI Surface and Semantics
+
+- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — 명령줄 인터페이스 및 실행 시맨틱
+
+### Component Port/Wire Fabric Model
+
+- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — 컴포넌트 포트/와이어 모델과 패브릭 라우팅
+
+### Two-Pass Data Execution
+
+- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
+
+### 2D Grid Program Identity
+
+- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D 그리드 program_id 시맨틱
+
+### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
+
+- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
+- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
+- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
+- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
+- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
+
+### IPCQ Direction Addressing
+
+- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
+
+### Intercube All-Reduce
+
+- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — 큐브 간 All-Reduce — pe0 큐브-메시 리듀스 + 다중-SIP 교환
+
+### Evaluation Harnesses
+
+- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce 평가 하니스 — `tests/sccl/`
+- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM 평가 하니스 — `scripts/gemm_sweep.py` + `tests/gemm/`
+- [ADR-0054](./ADR-0054-eval-milestone-benches.md) — 마일스톤 평가 bench — 자기완결적 sweep + figure bench
+
+### Bench Module Contract
+
+- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
+
+### Kernel-side tl.* API (TLContext)
+
+- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
+
+### Memory Allocator Algorithms
+
+- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
+
+### Probe Subcommand
+
+- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
+
+### Sim-engine Op Log and Memory Store Schemas
+
+- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
@@ -7,9 +7,10 @@ Accepted
 ## Context

 The `kernbench` CLI is the user-facing entry point of the simulator. It
-exposes three subcommands:
+exposes four subcommands:

 - `run` — execute a benchmark against a topology.
+- `list` — enumerate registered benches.
 - `probe` — diagnostic utility for latency / BW measurement.
 - `web` — interactive topology viewer.

@@ -33,8 +34,10 @@ Required arguments:

 - `--topology <path>`: topology YAML file path. Loaded via
  `resolve_topology()`.
- `--bench <name>`: benchmark name. Resolved via
-  `benches.loader.resolve_bench()`.
+- `--bench <identifier>`: benchmark identifier. Resolved via
+  `kernbench.benches.registry.resolve()`, which accepts either the
+  registered kebab-case name (e.g., `gemm-single-pe`) or a numeric
+  index from `kernbench list`.

 Optional arguments:

@@ -63,7 +66,21 @@ When `--device all` (or omitted) and the topology has multiple SIPs:
 The CLI does NOT spawn multiple OS processes or independent
 simulation runs — parallelism is internal to one simulation instance.

-### D4. `kernbench probe` — latency / BW diagnostic utility
+### D4. `kernbench list` — enumerate registered benches
+
+No arguments. Prints each registered bench's auto-assigned index,
+registered name, and one-line description.
+
+Benches register themselves via the `@bench(name=..., description=...)`
+decorator (`kernbench.benches.registry`). Every non-underscore module
+under `kernbench.benches/` MUST register at least one bench; a missing
+decorator raises `RuntimeError` at package import time.
+
+Indices are assigned alphabetically by name at import time. They are a
+CLI convenience (shorthand for `--bench`), not a stable API — a new
+bench inserted alphabetically will shift later indices.
+
+### D5. `kernbench probe` — latency / BW diagnostic utility

 Required argument:

@@ -87,7 +104,7 @@ that local-HBM access ≤ cross-PE-within-cube ≤ cross-cube ≤
 cross-SIP — and reports violations. Probe is a developer tool for
 verifying the latency / BW model; it is not a benchmark.

-### D5. `kernbench web` — topology viewer
+### D6. `kernbench web` — topology viewer

 Optional arguments:

@@ -102,7 +119,7 @@ the browser. Distinct from the static `docs/diagrams/` artifacts:
 - `kernbench web` is interactive — pan/zoom, hover for component
  attributes, switch between SIP / CUBE / PE views.

-### D6. Runtime API and simulation engine remain device-scoped
+### D7. Runtime API and simulation engine remain device-scoped

 - Runtime API calls operate on one device per invocation.
 - The simulation engine schedules all requests deterministically.
@@ -112,6 +129,10 @@ This invariant keeps each layer testable in isolation; device
 enumeration and multi-device fan-out live only in the CLI's `run`
 command (D3).

+The `probe` implementation lives under `kernbench.probes` (separate
+from `kernbench.benches`), reflecting that probes are diagnostic
+utilities, not registered benches.
+
 ## Consequences

 - Benchmark authors write single-device logic; multi-device behavior
@@ -173,6 +173,37 @@ placement = resolve_dp_policy(
 No post-hoc `pe_index` shifting — ShardSpec carries the `(sip, cube, pe)`
 structural coordinates directly. ShardSpec details in ADR-0026.

+### D5. SIP grid dimensions — explicit `sips.w/h` resolution
+
+For 2D inter-SIP topologies (`torus_2d`, `mesh_2d_no_wrap`) the SIP grid
+shape (width × height) is resolved from `system.sips.w` / `system.sips.h`,
+mirroring how D1 resolves `world_size` from `sips.count`. Precedence:
+explicit `w/h` (validated `w*h == count`) > square fallback
+(`round(sqrt(count))²`, used only when no `w/h` is given) > error.
+
+```python
+sips = spec.get("system", {}).get("sips", {})
+if sip_topo == "ring_1d":
+    w, h = 0, 0                          # 1D sentinel (no grid)
+elif sips.get("w") is not None and sips.get("h") is not None:
+    w, h = int(sips["w"]), int(sips["h"])
+    if w * h != n_sips:
+        raise ValueError(f"sip layout {w}x{h} != sips.count ({n_sips})")
+else:
+    side = int(round(math.sqrt(n_sips)))
+    if side * side != n_sips:
+        raise ValueError("non-square sips.count requires explicit sips.w/h")
+    w, h = side, side
+```
+
+This lifts the earlier assumption that 2D SIP grids must be perfect
+squares: a 6-SIP `torus_2d` / `mesh_2d_no_wrap` is now expressible as
+`w: 3, h: 2` (or `2x3`). The derived `(w, h)` feed the algorithm's
+inter-SIP exchange (consumed in ADR-0032 D5). The prior code path silently
+took `round(sqrt(count))²` for any non-ring topology, which produced a
+wrong grid (e.g. 2×2 for 6 SIPs); the explicit-`w/h` path with a
+fail-loud fallback replaces that.
+
 ---

 ## Dependencies
@@ -32,7 +32,7 @@ bandwidth characteristics for the common per-cube DP workload.

 ### Current state

- `src/kernbench/ccl/algorithms/intercube_allreduce.py` — kernel
+- `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` — kernel
 - `src/kernbench/ccl/sfr_config.py` — `configure_sfr_intercube_multisip`
 - `src/kernbench/runtime_api/distributed.py` — `AhbmCCLBackend` wires this
  automatically at `init_process_group` time.
@@ -43,29 +43,46 @@ bandwidth characteristics for the common per-cube DP workload.

 ## Decision

-### D1. Algorithm structure — 5 phases
+### D1. Algorithm structure — 5 phases (center-root, bidirectional)
+
+The root cube sits at the geometric **center** of the cube mesh:
+
+```
+root_col  = cube_w // 2
+root_row  = cube_h // 2
+root_cube = root_row * cube_w + root_col   # center; 10 on a 4×4 mesh
+```
+
+Each reduce/broadcast phase converges/diverges **bidirectionally** toward
+this center, halving the intra-SIP critical path versus a corner-root walk
+(4×4 mesh: 4 hops reduce + 4 hops broadcast vs 6+6 with an SE-corner root).

 For each SIP (launched concurrently by `mp.spawn`):

 ```
-Phase 1 — Row reduce W → E (cube mesh, pe0 only):
-    col=0 sends E → col=1 accumulates, sends E → ... → col=3 holds row sum.
+Phase 1 — Row reduce converging at col == root_col (cube mesh, pe0 only):
+    left half (col < root_col) walks W→E; right half (col > root_col)
+    walks E→W; the root_col cube merges both sides → holds row sum.

-Phase 2 — Col reduce N → S on rightmost column (pe0, col = mesh_w-1):
-    row=0 sends S → row=1 accumulates, sends S → ... → root cube (15)
-    holds the full SIP sum.
+Phase 2 — Col reduce on col == root_col converging at row == root_row:
+    above (row < root_row) walks N→S; below (row > root_row) walks S→N;
+    the root cube merges both → holds the full SIP sum.

-Phase 3 — Inter-SIP exchange on root cube (pe0 of root cube only):
+Phase 3 — Inter-SIP exchange on cube_id == root_cube (pe0 only):
    Ring / torus-2d row+col ring / mesh-2d chain reduce+broadcast —
    selected by sip_topo_kind (from topology.yaml sips.topology).

-Phase 4 — Col broadcast S → N on rightmost column.
+Phase 4 — Col broadcast on col == root_col, outward from root_row.

-Phase 5 — Row broadcast E → W across the cube mesh.
+Phase 5 — Row broadcast outward from root_col across the cube mesh.
 ```

 After all phases every cube's pe0 holds the global sum.

+**Single-cube fast-path**: when `cube_w == cube_h == 1` (one cube per rank,
+the common TP case), the intra-SIP reduce/broadcast phases are skipped and
+the kernel goes straight to the Phase 3 inter-SIP exchange.
+
 The kernel is a single function parameterised by `sip_topo_kind ∈ {0, 1, 2}`
 (ring_1d, torus_2d, mesh_2d_no_wrap). Phases 1-2 and 4-5 are identical
 across topologies; only phase 3 branches. Helper functions
@@ -121,20 +138,24 @@ system:
 ```

 - `ring_1d`: n_sips-1 rounds of `send global_E / recv global_W`.
- `torus_2d`: sqrt(n_sips)×sqrt(n_sips) wrapping mesh. Row ring on
-  `global_E/W` then col ring on `global_S/N`.
- `mesh_2d_no_wrap`: square mesh without wrap-around. Chain reduce +
+- `torus_2d`: `w × h` wrapping mesh. Row ring on `global_E/W` then col
+  ring on `global_S/N`.
+- `mesh_2d_no_wrap`: `w × h` mesh without wrap-around. Chain reduce +
  broadcast per dimension.

-2D variants require `n_sips` to be a perfect square.
+2D grid dims `(w, h)` come from `system.sips.w/h` (ADR-0024 D5). A square
+fallback (`round(sqrt(n_sips))²`) applies **only** when `w/h` are omitted,
+so rectangular grids (e.g. 6 SIPs as `3×2`) are supported by giving
+explicit `w/h`.

 ### D5. Process-group integration — `AhbmCCLBackend`

 At `init_process_group` time the backend:

 1. Loads `ccl.yaml` + `topology.yaml`.
-2. Derives `sip_topo_kind, sip_topo_w, sip_topo_h` from
-   `system.sips.topology` using the algorithm module's `TOPO_NAME_TO_KIND`.
+2. Derives `sip_topo_kind` from `system.sips.topology` via the algorithm
+   module's `TOPO_NAME_TO_KIND`, and `sip_topo_w, sip_topo_h` from
+   `system.sips.w/h` with a square-only fallback (ADR-0024 D5).
 3. Calls `configure_sfr_intercube_multisip(engine, spec, cfg)` — one-time
   SFR wiring, mirrors NCCL communicator creation.

@@ -154,17 +175,19 @@ At each `dist.all_reduce(tensor)` call:

 ```yaml
 defaults:
-  algorithm: intercube_allreduce
+  algorithm: lrab_hierarchical_allreduce
  buffer_kind: tcm
  ...

 algorithms:
-  intercube_allreduce:
-    module: kernbench.ccl.algorithms.intercube_allreduce
+  lrab_hierarchical_allreduce:
+    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
    topology: none
    buffer_kind: tcm
    n_elem: 8
-    root_cube: 15
+    root_cube: 15   # NOT read today — the kernel elects the root dynamically
+                    # as the geometric center (see D1). Kept as a placeholder
+                    # for a future explicit-root override / runtime election.
 ```

 `topology.yaml`:
@@ -203,13 +226,16 @@ Modules loaded via `cfg["module"]` must export:

 - **Per-PE allreduce** (intra-cube PE-to-PE reduce). Out of scope — the
  workload for this algorithm is per-cube DP.
- **Asymmetric SIP topologies** (non-square mesh/torus). `torus_2d` and
-  `mesh_2d_no_wrap` require `n_sips = k²`.
+- **Square-grid fallback requires `n_sips = k²`**: rectangular SIP grids
+  (non-square mesh/torus) are supported, but only when `system.sips.w/h`
+  are given explicitly (ADR-0024 D5). With `w/h` omitted, 2D topologies
+  fall back to a square grid and still require `n_sips = k²`.
 - **Pipelined chunks**: single-tile per cube, no pipelining yet.
 - **Root cube runtime election**: the kernel currently uses
-  `root_cube = (mesh_h - 1) * mesh_w + (mesh_w - 1)` hardcoded to the SE
-  corner. SFR wiring covers all cubes, so runtime election is a pure kernel
-  change when needed.
+  `root_cube = (mesh_h // 2) * mesh_w + (mesh_w // 2)` — the geometric
+  center, chosen to minimize the intra-SIP critical path. SFR wiring
+  covers all cubes, so electing a different root at runtime is a pure
+  kernel change when needed.

 ---

@@ -242,15 +268,14 @@ Modules loaded via `cfg["module"]` must export:

 | File | Change |
 |---|---|
-| `src/kernbench/ccl/algorithms/intercube_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
+| `src/kernbench/ccl/algorithms/lrab_hierarchical_allreduce.py` (new) | Kernel + `_inter_sip_*` helpers + `TOPO_NAME_TO_KIND` |
 | `src/kernbench/ccl/sfr_config.py` (new) | `configure_sfr_intercube_multisip` |
 | `src/kernbench/ccl/topologies.py` | Added `torus_2d`, `mesh_2d_no_wrap` |
 | `src/kernbench/ccl/install.py` | Extended `_OPPOSITE_DIR` with `global_*` pairs |
 | `src/kernbench/runtime_api/distributed.py` | `AhbmCCLBackend` uses `configure_sfr_intercube_multisip` + appends sip_rank/topo args |
-| `ccl.yaml` | Single `intercube_allreduce` entry |
+| `ccl.yaml` | Single `lrab_hierarchical_allreduce` entry |
 | `topology.yaml` | Added `system.sips.topology` |
 | `benches/ccl_allreduce.py` | Row-wise cube-mesh tensor layout |
-| `tests/test_allreduce_multidevice.py` (new) | Config-driven ring/torus/mesh |
-| `tests/test_distributed_intercube_allreduce.py` (new) | Full `dist.all_reduce` path |
-| `tests/test_intercube_sfr_config.py` (new) | SFR wiring verification |
+| `tests/sccl/` (test package) | Config-driven ring/torus/mesh correctness + full `dist.all_reduce` path + latency/buffer-kind sweeps (evaluation harness — ADR-0043) |
+| `tests/test_intercube_sfr_config.py` | SFR wiring verification |
 | Removed | `ring_allreduce.py`, `mesh_allreduce.py`, `tree_allreduce.py`, `hierarchical_allreduce.py`, `hello_send.py`, `testing.py` and their tests |
@@ -0,0 +1,139 @@
+# ADR-0038: PCIE_EP Component Model
+
+## Status
+
+Accepted (2026-05-20).
+
+Companion to ADR-0035 (M_CPU), ADR-0036 (IO_CPU), and
+ADR-0037 (Forwarding) at the same component-model level.
+
+## First action
+
+Pull one Transaction from `_inbox` and let `_forward_txn` invoke `run()`, which
+applies a single `env.timeout(node.attrs["overhead_ns"])` for PCIe protocol
+handling. After that the standard `ComponentBase` worker rules take over: if
+`next_hop` exists, put the advanced Transaction on `out_ports[next_hop]`;
+otherwise consume `drain_ns` and call `txn.done.succeed()`.
+
+In other words, **PCIE_EP's first (and only) act is to spend the configured
+overhead as simulator time** — no routing decisions, no payload transformation,
+no MMIO decoding.
+
+## Context
+
+PCIE_EP is the **host ↔ device boundary** in the topology graph. The builder
+(`topology/builder.py`) creates an IO chiplet instance per SIP that contains
+`pcie_ep`, `io_cpu`, and `io_noc`, and lays bidirectional edges between the
+external `fabric.switch0` and each `pcie_ep`:
+
+- `switch → pcie_ep`: host → device traffic (MemoryWrite, MemoryRead,
+  KernelLaunch).
+- `pcie_ep → switch`: device-side outbound (e.g., cross-SIP IPCQ tokens).
+
+Inside the IO chiplet there are bidirectional `pcie_ep ↔ io_noc` edges, and
+from there traffic branches to `io_cpu` or to the cube-side `hbm_ctrl` path
+(see ADR-0036 IO_CPU model). The router and resolver already know — per SPEC
+R7 — that PCIE_EP is the endpoint for memory operations, so helpers like
+`find_pcie_ep(sip)` and `find_memory_path(pcie_ep, dst_node)` treat PCIE_EP as
+the start (or end) of the memory path.
+
+The problem is that all of this dependency lives in builder/router/resolver,
+while **PCIE_EP's own internal model has no ADR**. The consequence:
+
+- "What latency does PCIE_EP model?" requires reading the source.
+- The asymmetry with peer components (IO_CPU = ADR-0036, M_CPU = ADR-0035) is
+  awkward.
+- Future decisions about a more detailed PCIe link-layer model (TLP credits,
+  retry, MPS chunking) lack a documented baseline.
+
+This ADR pins down the current **thin PCIE_EP model** and records that this
+thinness is intentional (aligned with ADR-0033's latency-model simplification
+policy).
+
+## Decision
+
+### D1. PCIE_EP uses ComponentBase's generic forwarding worker as-is
+
+`PcieEpComponent` extends `ComponentBase` and does **not** override `_worker` or
+`_forward_txn`. Every Transaction flows through the standard sequence:
+
+1. `_fan_in` accumulates inbound messages (and reassembles Flits, per ADR-0033
+   Phase 2c) into `_inbox`.
+2. `_worker` pulls one message off `_inbox` and spawns
+   `env.process(self._forward_txn(env, txn))` for per-message pipelining.
+3. `_forward_txn` calls the op_log start hook → `run()` for latency → op_log
+   end hook.
+4. `run()` is a single line: `yield env.timeout(overhead_ns)`.
+5. If a next hop exists, `out_ports[next_hop].put(txn.advance())`. Otherwise
+   (terminal arrival) consume `drain_ns` and call `txn.done.succeed()`.
+
+### D2. The only timing parameter is `overhead_ns`
+
+Only `node.attrs["overhead_ns"]` is accepted as a latency parameter. The code
+default is `0.0`; `topology.yaml`'s IOChiplet `components.pcie_ep.attrs`
+supplies the real value (current topology: `overhead_ns: 5.0` ns).
+
+No separate BW-serialization resource (`simpy.Resource`), no queue depth, no
+retry model is introduced. Link-level BW serialization is handled wire-side —
+inside the IOChiplet by `pcie_ep_to_noc_bw_gbs = 256.0 GB/s`, and externally by
+the system's `io_ep_to_switch` link BW (ADR-0015 port/wire model). PCIE_EP
+itself takes no part in that accounting.
+
+### D3. PCIE_EP is direction-aware in topology but direction-blind in code
+
+The builder lays both `switch ↔ pcie_ep` and `pcie_ep ↔ io_noc` edges, so
+PCIE_EP serves:
+
+- inbound (host → device): forward Transactions arriving from the switch onto
+  io_noc-side next-hop.
+- outbound (device → host): forward Transactions arriving from io_noc/io_cpu
+  back to the switch.
+
+Both are handled by D1's generic forwarding worker; the component code never
+distinguishes direction (it just follows `txn.next_hop`).
+
+### D4. PCIE_EP is not Flit-aware (legacy reassembly path)
+
+`_FLIT_AWARE` is left at the inherited `False`, so `_fan_in` reassembles
+upstream-chunkified Flits into the parent Transaction before delivery to
+`_inbox` (aligned with ADR-0033 Phase 2c incremental rollout).
+
+A future PCIe TLP-level credit model would revisit D4.
+
+### D5. PCIE_EP is a **named node** for routing helpers
+
+`policy/routing/router.py` provides `find_pcie_ep(sip, io_id="io0")`,
+`find_all_pcie_eps()`, and `find_memory_path(pcie_ep, dst_node)` — all of
+which treat PCIE_EP as the start (or end) of the memory path. The component
+itself supplies no information to these helpers; the naming convention
+(`sip{S}.{io_id}.pcie_ep`) is guaranteed by the topology builder.
+
+## Alternatives Considered
+
+### A1. Full PCIe TLP-level model (credits, retry, MPS chunking)
+
+Rejected. Violates ADR-0033's "current latency model = abstract overhead + BW
+serialization" simplification. Host↔device protocol fidelity is explicitly
+out-of-scope in SPEC §5 "Non-Goals".
+
+### A2. Per-PCIE_EP `simpy.Resource` for in-flight cap
+
+Rejected. Host traffic is not a contention bottleneck in current workloads.
+Defer to a separate ADR if it becomes one (in which case D1 stays and D2 is
+extended).
+
+### A3. Merge PCIE_EP into IO_CPU
+
+Rejected. PCIE_EP is the protocol-boundary node first hit on the host side;
+IO_CPU is the device-side control-plane processing node (ADR-0036). Traffic
+fan-out and command decoding costs concentrate in IO_CPU, while PCIE_EP only
+expresses link-edge overhead. Merging them would mix two responsibilities and
+violate the spirit of ADR-0007 (runtime API/sim_engine boundaries).
+
+## Consequences
+
+- PCIE_EP gets an explicit model ADR despite having near-zero code — consistent
+  with peer component ADRs, lower maintenance friction.
+- Future PCIe-level refinement supersedes by extending D2/D4 in a new ADR.
+- D5 makes the named-node dependency explicit, so any future renaming of
+  component IDs has a clearly bounded blast radius.
@@ -0,0 +1,203 @@
+# ADR-0039: PE_MMU Component Model — Component + Utility Dual Role
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0011 (PA/VA/LA address model) only states that "the VA model translates
+VA→PA via PE_MMU"; this ADR pins down **the PE_MMU component's own behavior
+model**.
+
+## First action
+
+At construction, read `node.attrs["page_size"]` (default `2 MiB`) and
+`node.attrs["tlb_overhead_ns"]` (default `0.0`) and instantiate the internal
+`PeMMU` utility object (`policy.address.pe_mmu.PeMMU`) exactly once. That
+object is the single owner of the page table, the sub-page region lists, and
+the TLB overhead value.
+
+At runtime the first action splits into two paths:
+
+- **Component path (inbox consumption)**: `_worker` pulls a Transaction off
+  `_inbox`; if `request` is a `MmuMapMsg`, call `self._mmu.map(va, pa, size)`
+  for each entry and then `txn.done.succeed()`. For `MmuUnmapMsg`, call
+  `unmap(va, size)`. Any other type falls through to standard `_forward_txn`.
+  In other words, **the component's first act is "apply map/unmap commands to
+  the page table"**.
+- **Utility path (direct call)**: a sibling PE engine (PE_DMA / PE_GEMM) calls
+  `pe_mmu.mmu.translate(va)` directly. This path produces no SimPy events;
+  the caller (when `overhead_ns > 0`) issues a `yield env.timeout(mmu.overhead_ns)`
+  in its own process.
+
+## Context
+
+ADR-0011 defined three address models (PA/VA/LA) and agreed that "VA model =
+translation via PE_MMU". But in code, `PeMmuComponent` performs two
+complementary roles simultaneously:
+
+1. **A topology-graph component**: it receives `MmuMapMsg` / `MmuUnmapMsg`
+   sideband messages over the cube NoC and updates the page table.
+2. **A PE-local utility**: PE_DMA / PE_GEMM on the same PE call
+   `translate(va)` directly with zero SimPy latency (the caller pays
+   `overhead_ns` if any).
+
+Without an ADR covering both roles, the following questions are ambiguous:
+
+- "Why isn't there a SimPy event for the MMU translate?" (Answer: the caller
+  pays it.)
+- What is the sub-page region model, and why? (The code docstring has it, but
+  no ADR — only a memory note `project_mmu_subpage_stopgap`.)
+- Who sends map/unmap, and when must they be visible? (Ordering contract.)
+
+Additionally, `PeMMU.map()` has "append, last-write-wins on overlap"
+semantics, which is impossible to express with a one-PA-per-entry page table.
+That is a deliberate **simulator stopgap** to support DPPolicy sub-page sharding
+(e.g., 128 B payloads against 4 KiB pages) without silent last-write-wins
+misrouting. This deviation from real HW MMU semantics must be ADR-pinned.
+
+## Decision
+
+### D1. Explicit dual role — component and utility
+
+`PeMmuComponent` exposes two interfaces from a single class:
+
+- Component interface: `_inbox` consumption, `_worker` loop (handles MMU
+  sideband messages).
+- Utility interface: the `mmu` property exposes the underlying `PeMMU` object,
+  which PE_DMA / PE_GEMM hold directly and invoke `translate()` on.
+
+The latter is **not a layer skip**: inside a PE, the engines and PE_MMU are
+siblings under the "components" layer (ADR-0007). Cross-layer violations only
+apply to runtime API ↔ sim_engine ↔ components boundaries.
+
+### D2. Latency model — `translate()` is pure; caller owns the timeout
+
+`PeMMU.translate()` is a pure function and yields nothing in SimPy. The caller
+(a PE engine) issues `if mmu.overhead_ns > 0: yield env.timeout(mmu.overhead_ns)`
+in its own process after translation.
+
+Rationale: the PE engine process already holds its own `record_start` /
+`record_end` (op_log) hooks, so keeping timing inside the caller's process
+preserves consistent timing accounting. A separate MMU process would split the
+engine's processing flow and blur op_log / pipeline overlap semantics.
+
+#### D2.1. Current implementation asymmetry — pipeline vs non-pipeline (known)
+
+At the time of writing, `pe_dma.py` handles MMU overhead differently in its
+two call paths:
+
+- **non-pipeline (`handle_command`)**: after `translate()`, applies
+  `if self._mmu.overhead_ns > 0: yield env.timeout(self._mmu.overhead_ns)`.
+- **pipeline (`_do_pipeline_dma`)**: calls `translate()` only, **omitting**
+  the overhead timeout — though the comment says "same logic as non-pipeline
+  path", the behaviors differ.
+
+In the default topology, `tlb_overhead_ns = 0.0`, so this asymmetry does not
+manifest. With `tlb_overhead_ns > 0`, however, GEMM/Math via the pipeline path
+appears MMU-overhead faster than the equivalent non-pipeline workload.
+
+The D2 contract states that **all** callers pay the overhead; the pipeline
+omission is **not an intentional design** — ADR-0014 D6 (pipeline self-routing)
+does not exempt it. Remediation options (require a separate Phase 1/2):
+
+- (a) Add `if mmu.overhead_ns > 0: yield env.timeout(...)` in
+  `_do_pipeline_dma` to align with D2 — **preferred**.
+- (b) Narrow the D2 contract to "non-pipeline only" and document the pipeline
+  exemption in an ADR-0014 update — discouraged, since it weakens the
+  overhead's meaning.
+
+This ADR recommends (a) and assumes a small follow-up change either before or
+just after acceptance.
+
+### D3. Page table structure — sub-page region list (stopgap)
+
+`self._table: dict[vpn, list[(start_in_page, end_in_page, pa_at_offset_zero)]]`
+holds multiple disjoint regions per page.
+
+- `map(va, pa, size)`: append regions when the range crosses a page boundary.
+- `translate(va)`: look up regions for the VPN and iterate **in reverse** so
+  the most recent overlapping region wins (last-write-wins).
+- `unmap(va, size)`: remove only regions whose extent is **fully contained**
+  within the unmap range; partial-overlap boundaries are left in place and the
+  caller is expected to unmap on the same boundaries used for map.
+
+This is documented as a **simulator stopgap** that supplements the VA model
+from ADR-0011. It prevents silent last-write-wins misrouting when DPPolicy
+shards below page granularity. Memory note: `project_mmu_subpage_stopgap`.
+
+### D4. PageFault signals PA fallback
+
+If `translate()` is called with an unmapped VA, `PageFault` is raised. PE_DMA
+catches the exception and **uses the original address as a PA** (the PA-only
+backward-compatibility path from ADR-0011). PageFault is therefore not an
+error — it is the signal for "no VA mapping, interpret as PA".
+
+This path is intentional and preserves backward compatibility with the
+ADR-0011 PA-only mode.
+
+### D5. MMU sideband-message reception contract
+
+`MmuMapMsg` / `MmuUnmapMsg` arrive over the fabric at PE_MMU's `_inbox`
+(SPEC R10: "MMU map installation incurs measured fabric latency"). Schemas
+live in `runtime_api/kernel.py`:
+
+- `MmuMapMsg.entries: tuple[dict, ...]` — each dict is
+  `{"va": int, "pa": int, "size": int}`.
+- `MmuUnmapMsg.entries: tuple[dict, ...]` — each dict is
+  `{"va": int, "size": int}`.
+
+PE_MMU reception flow:
+
+1. `_worker` does `_inbox.get()` for one message.
+2. `hasattr(msg, "request")` confirms a Transaction wrapper.
+3. `isinstance(msg.request, MmuMapMsg)` → for each entry, call
+   `self._mmu.map(va=e["va"], pa=e["pa"], size=e["size"])`.
+4. `isinstance(msg.request, MmuUnmapMsg)` → for each entry, call
+   `self._mmu.unmap(va=e["va"], size=e["size"])`.
+5. Both signal `msg.done.succeed()` after completion.
+
+An external caller (runtime API) `await`ing `done` therefore receives a SimPy
+guarantee that "the mapping is installed on-device" — this is the realization
+of ADR-0011's "MMU map installation incurs measured fabric latency".
+
+This ADR does **not** define the **sender or fan-out policy** for the sideband
+message — those are runtime API responsibilities. Only the receive contract
+belongs here.
+
+### D6. Non-MMU Transactions delegate to generic forwarding
+
+If a message pulled from `_inbox` is not `MmuMapMsg` / `MmuUnmapMsg` (or
+lacks a `request` attribute), `_forward_txn` handles it normally. This keeps
+the door open for future topologies where PE_MMU sits on a pass-through path —
+current code never sends such traffic, but the routing remains safe.
+
+## Alternatives Considered
+
+### A1. Make `translate()` a SimPy generator
+
+Rejected. As D2 explains, this blurs op_log / pipeline overlap accounting in
+the PE engine.
+
+### A2. Use small page size (e.g., 128 B) instead of sub-page regions
+
+Rejected. Would explode page-table memory and cube-wide map message size. Most
+mappings are 2 MiB; pushing the page size below that for the few DPPolicy
+sharding cases inflates average cost.
+
+### A3. Make PE_MMU a PE_CPU helper only (not a topology node)
+
+Rejected. ADR-0011 requires that MMU map installation incur measured fabric
+latency (via `MmuMapMsg`), which requires PE_MMU to be a node on the graph.
+It also keeps cube NoC visualizer output consistent.
+
+## Consequences
+
+- PE_MMU's dual role is justified at ADR level, so future "unify into one"
+  refactor pressure has a documented counterpoint.
+- The sub-page region model is explicitly labeled a stopgap, providing a
+  basis for deprecating it when LA model (ADR-0011) lands.
+- The "`translate()` does not yield" contract is locked in (D2), so any
+  future proposal to add an internal MMU timeout can be denied with a
+  documented rationale.
+- PA fallback (D4) is normalized, preventing defensive logic from treating
+  PageFault as an error.
@@ -0,0 +1,149 @@
+# ADR-0040: PE_TCM Component Model — Dual-Channel BW Serialization
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0014 (PE Pipeline Execution Model, D1) references PE_TCM as a "BW-based
+serialized scratchpad memory" but does not pin down the component's own model.
+This ADR fills that gap.
+
+## First action
+
+When `start()` is invoked, immediately create two `simpy.Resource(env, capacity=1)`
+instances and store them in `self._read_res` / `self._write_res`. These two
+resources are the single decision points that serialize the **read channel**
+and **write channel** to one in-flight request each.
+
+The runtime first action: `_worker` pulls a message off `_inbox` and branches
+by type:
+
+- `TcmRequest` (from `pe_fetch_store`): spawn `env.process(self._handle_tcm_request)`.
+  Hence **TCM's first act is "acquire the lock matching the direction
+  (read/write)"**. After lock acquisition, if `bw > 0 and nbytes > 0`, yield
+  `env.timeout(delay_ns = nbytes / bw)`, then `req.done.succeed()`.
+- Anything else (Transaction): spawn `env.process(self._forward_txn)` (legacy
+  fabric pass-through).
+
+At construction, `node.attrs["read_bw_gbs"]` and `node.attrs["write_bw_gbs"]`
+(default `512.0 GB/s` each) are captured and held.
+
+## Context
+
+In the PE pipeline (ADR-0014 D1, D6), PE_TCM receives two kinds of traffic:
+
+1. **`TcmRequest` from PE_FETCH_STORE** — when moving data between TCM and
+   the register file, PE_FETCH_STORE sends a short sideband request to obtain
+   BW-serialized access latency (`direction = "read"` or `"write"`, `nbytes`,
+   `done` event).
+2. **Legacy Transaction forwarding** — a fallback in case TCM ends up as a
+   pass-through node on the fabric graph (not used by the current critical
+   path, but preserved).
+
+The problem: ADR-0014 only says "BW-based serialization" without specifying:
+
+- Read and write are **independent channels** running in parallel; only
+  same-direction concurrency serializes at `capacity=1`.
+- BW is split into two configurable values (`read_bw_gbs` / `write_bw_gbs`).
+- The formula is `delay_ns = nbytes / bw_gbs` (loose unit convention:
+  GB/s × ns ≈ B).
+- `nbytes == 0` still acquires the lock but skips the BW term.
+- `run()`'s `overhead_ns` (default `0.0`) is only used in the legacy fabric
+  forwarding path.
+
+Each of these requires an ADR. In particular, "why are read and write
+separate channels" and "who owns the BW values" must be documented so that
+future changes (e.g., `capacity=2`) have a clear basis.
+
+## Decision
+
+### D1. Dual channel — read and write are independent resources
+
+`_read_res = simpy.Resource(env, capacity=1)`,
+`_write_res = simpy.Resource(env, capacity=1)`.
+Same-direction concurrent requests queue on the resource and serialize;
+opposite-direction requests proceed in parallel. This matches the hardware
+model where TCM has a dual-port (read + write) configuration, and it allows
+the simulator to express the GEMM-pipeline case where fetch (read) and store
+(write) overlap in time — modeled as BW-serialized inside each direction but
+independent across directions.
+
+### D2. Per-channel BW model — `nbytes / bw_gbs`
+
+After lock acquisition, if `nbytes > 0 and bw > 0`, yield
+`env.timeout(nbytes / bw_gbs)`. The unit convention is GB/s × ns ≈ B,
+consistent with the simulator-wide loose convention (see ADR-0033).
+
+- `nbytes == 0`: BW term is zero, but the lock is acquired and released. This
+  is intentional: when a plan generator emits an empty fetch/store on the
+  PE_FETCH_STORE side, the op_log / channel accounting on the TCM side still
+  records one consumption.
+- `bw == 0` (config error): the timeout call is skipped (0-time pass). Should
+  not occur with normal settings.
+
+### D3. BW values come from `node.attrs.read_bw_gbs` / `write_bw_gbs`
+
+Defaults `512.0 GB/s`. The topology builder (`topology/builder.py`) passes
+these attrs when instantiating TCM from `pe_template`. Default changes should
+coincide with related decisions in ADR-0014 D1 or ADR-0033.
+
+### D4. TcmRequest schema is owned by PE_TCM
+
+`@dataclass TcmRequest(direction: str, nbytes: int, done: simpy.Event, tag: str = "")`
+lives in `components/builtin/pe_tcm.py`. PE_FETCH_STORE imports the dataclass
+and only constructs/sends it. The caller does not define the schema because:
+
+- The meaning of BW serialization is TCM's responsibility — TCM decides which
+  fields drive serialization.
+- The valid-value check for `direction` (must be `"read"` or `"write"`) lives
+  in `_handle_tcm_request`'s if/else branch.
+
+### D5. Legacy Transaction forwarding path is preserved
+
+When `_worker` receives a non-`TcmRequest` message, it dispatches to
+`_forward_txn`, applying `run()`'s `overhead_ns`. The current standard PE
+pipeline does not route Transactions through TCM, but the path is kept to
+avoid breakage if fabric topology changes.
+
+This path is accounted for via standard Transaction op_log; the BW channel
+locks are **not** acquired (orthogonal to D1's usage).
+
+### D6. PE_TCM is not a data store (timing only)
+
+TCM models **time only**. The actual data payload is held by sim_engine's
+`memory_store` (when present); the TCM component never updates it.
+PE_FETCH_STORE obtains BW delay through `TcmRequest`, and register contents
+are handled separately in the data path (ADR-0020 2-pass data execution —
+Phase 2).
+
+## Alternatives Considered
+
+### A1. Single channel (`capacity=2` for shared read+write)
+
+Rejected. Would artificially serialize the normal-case overlap of fetch
+(read) and store (write) and yield an incorrect BW upper bound for the PE
+pipeline.
+
+### A2. `capacity > 1` (e.g., 2-banked TCM)
+
+Rejected. Current hardware model assumes a single bank. Multi-bank extension
+needs its own ADR that would supersede D1. Bumping capacity now would loosen
+the nominal serialization without raising the BW upper bound, producing less
+accurate modeling.
+
+### A3. Generalize BW formula to `nbytes / bw + overhead_ns`
+
+Rejected. `overhead_ns` is reserved for the legacy forwarding path (D5).
+Additional fetch/store-path overhead, if needed, belongs in PE_FETCH_STORE's
+`run()` or in a register-file access model — closer to the responsibility
+boundary.
+
+## Consequences
+
+- TCM's BW accounting is locked at ADR level. Questions arising from op_log
+  in GEMM/Math sweeps — "why did fetch and store overlap?", "why do only
+  same-direction requests serialize?" — resolve quickly to D1.
+- Future multi-bank TCM models or asymmetric read/write BW changes have a
+  clear blast radius (D1 / D2 / D3 — pick one).
+- D6 ("TCM is not a data store") sharpens the responsibility boundary with
+  ADR-0020 2-pass execution.
@@ -0,0 +1,195 @@
+# ADR-0041: Cube SRAM Component Model — terminal scratchpad on cube NoC
+
+## Status
+
+Accepted (2026-05-20).
+
+ADR-0017 (Cube NOC and HBM Connectivity) describes SRAM as a cube-NoC
+attachment but does not specify the SRAM component's own latency / response
+model. This ADR fills that gap.
+
+## First action
+
+Inside `_worker`, immediately after pulling a Transaction off `_inbox`, the
+very first action is `yield from self.run(env, txn.nbytes)`. Inside `run()`,
+the component applies `env.timeout(node.attrs["overhead_ns"])`
+(default `0.0`).
+
+In short, **SRAM's first act is "express access overhead as simulator time"**.
+After overhead, the worker yields `drain_ns` (the terminal BW-serialization
+cost stamped on the Transaction) and then constructs and dispatches a
+`ResponseMsg` on the reverse path.
+
+This differs from a generic `ComponentBase._worker`: SRAM knows it is a
+**terminal node**, so it does not go through `_forward_txn`. Its own worker
+explicitly performs `run → drain → _send_response`.
+
+## Context
+
+The cube topology (`topology/builder.py`) creates the following named nodes
+per cube:
+
+- `sip{S}.cube{C}.m_cpu`
+- `sip{S}.cube{C}.sram`
+- `sip{S}.cube{C}.hbm_ctrl` (per-PE partitions)
+- `sip{S}.cube{C}.pe{P}` (and its PE-internal sub-components)
+
+SRAM is one of the cube-NoC attachments — `topology/mesh_gen.py` assigns it
+to the nearest router by placement coordinates and adds `"sram"` to that
+router's `attach` list. The builder lays bidirectional `sram ↔ router` edges
+(BW: `sram_to_router_bw_gbs`, default `128.0 GB/s`).
+
+SRAM has two intertwined roles:
+
+1. **Fabric terminal**: the endpoint for cube-NoC memory-access Transactions
+   destined for SRAM. SRAM consumes access overhead + drain, then sends a
+   response back on the reverse path.
+2. **One of the IPCQ slot tiers**: ADR-0023 D9.7 defines
+   `buffer_kind ∈ {tcm, sram, hbm}`; the `sram` tier's per-access cost is
+   `(512.0 GB/s, 2.0 ns)` in `common/ipcq_types._BUFFER_KIND_BW`. This is
+   separate from the SRAM node's `overhead_ns` attr; PE_DMA accounts for it
+   directly at the IPCQ slot-write moment.
+
+Without an ADR covering both roles, the following questions are ambiguous:
+
+- "What latency does SRAM model?" — fabric drain + overhead, or the IPCQ
+  tier slot latency? — answers scatter.
+- What does the `size_mb` (`32`) attr mean in the future? Currently it is not
+  used; SRAM only models timing.
+- Which cube router does SRAM attach to? (placement-based; lives in topology
+  code only.)
+
+## Decision
+
+### D1. SRAM is a terminal scratchpad node on the cube NoC
+
+`SramComponent` extends `ComponentBase` but overrides `_worker` to express
+terminal semantics directly:
+
+```
+while True:
+    txn = yield self._inbox.get()
+    yield from self.run(env, txn.nbytes)     # overhead_ns
+    if drain_ns > 0: yield env.timeout(drain_ns)
+    yield from self._send_response(env, txn)
+```
+
+This pattern is necessary because SRAM must know the reverse path; the
+generic `_forward_txn` (which forwards to the next hop) does not fit a
+terminal.
+
+#### D1.1. Currently dormant — the `_worker` override is an unused path
+
+At the time of writing, **no component actually sends a Transaction to the
+SRAM node**. The verified references to the SRAM node ID are:
+
+- `policy/routing/router.py` and friends — guarantee path lookups.
+- `components/builtin/pe_dma.py::_handle_ipcq_inbound` — for
+  `buffer_kind == "sram"`, computes the *path* to
+  `bank_node = f"{cube_prefix}.sram"` via `compute_drain_ns(path, ...)` and
+  yields a **local** timeout. The Transaction itself does not flow to the
+  SRAM node (see D4).
+- `tests/test_routing.py` — checks connectivity via
+  `find_path("sip0.cube0.pe0", "sip0.cube0.sram")`.
+
+So the `_worker` / `_send_response` override is currently a **dormant code
+path**. It is preserved deliberately:
+
+- Topology changes that route fabric Transactions to SRAM terminally (e.g.,
+  explicit M_CPU → SRAM accesses) would activate it immediately.
+- ADR-0017's "cube-attached scratchpad" semantics naturally implies terminal
+  behavior; the override is an intentional placeholder.
+
+A future ADR (or a revision to this one) will mark dormancy resolved when an
+actual sender is added.
+
+### D2. ResponseMsg construction and reverse-path dispatch
+
+`_send_response`:
+
+1. `reverse_path = list(reversed(txn.path))` — derive the reverse path.
+2. Construct `ResponseMsg(correlation_id=txn.request.correlation_id,
+   request_id=..., src_cube=<this cube>, src_pe=-1, success=True)`.
+3. Wrap in `Transaction(request=resp_msg, path=reverse_path, step=0,
+   nbytes=0, done=env.event(), is_response=True)` and put on
+   `out_ports[reverse_path[1]]`.
+4. If the reverse path is too short (`< 2 hops`) or `ctx` is absent, fall
+   back to calling the original `txn.done.succeed()`.
+
+`src_pe = -1` means "SRAM is not PE-localized". `src_cube` is parsed from the
+node ID (`sip{S}.cube{C}.sram`).
+
+### D3. Timing parameters: `overhead_ns` and wire-side `drain_ns`
+
+- **Component-side latency**: `node.attrs["overhead_ns"]`. Default topology
+  uses `2.0 ns`.
+- **Link-side serialization**: `drain_ns` arrives stamped on the Transaction
+  — the wire-side BW serialization result from ADR-0015. SRAM only yields it.
+- The `size_mb` (default `32 MiB`) attr is currently timing-neutral. If a
+  capacity-aware model is added in the future, a separate ADR will give it
+  meaning.
+
+### D4. IPCQ slot accounting is not modeled by the SRAM component
+
+Per ADR-0023 D9.7, the IPCQ slot-write latency for the SRAM tier is incurred
+inside PE_DMA's `_handle_ipcq_inbound`, which calls
+`slot_io_latency_ns("sram", nbytes)` using `_BUFFER_KIND_BW["sram"]`. That is:
+
+- When SRAM receives a fabric Transaction (D1, D2, D3 apply), it processes
+  normally.
+- When an IPCQ slot lives on SRAM, PE_DMA pays the slot-write time directly —
+  independent of the SRAM component.
+
+This separation is intentional: IPCQ is a fast path (sub-cycle slot
+bookkeeping) and does not traverse fabric Transactions, so SRAM does not need
+to know about IPCQ.
+
+### D5. SRAM's cube-NoC attachment is placement-driven
+
+`topology/mesh_gen.py` reads `placement.sram.pos_mm` (default `[1.5, 9.0]` in
+`topology.yaml`) and adds `"sram"` to the nearest router's `attach`. The
+builder (`topology/builder.py`'s attachment loop) then lays bidirectional
+edges between the `sram` node and that router.
+
+This decision lives outside the SRAM component (mesh_gen / builder); the
+component does not know which router it sits on. It only relies on
+`txn.path` / `reverse_path` to reach it via a router.
+
+### D6. SRAM is not a data store (timing only)
+
+Same context as ADR-0040 D6: the SRAM component models time only; the data
+payload (if any) lives in sim_engine's `memory_store`.
+
+## Alternatives Considered
+
+### A1. Use `_forward_txn` and route responses via separate nodes (à la IO_CPU / HBM_CTRL)
+
+Rejected. SRAM is a terminal on the cube NoC; adding a response node would
+introduce meaningless hops and violate ADR-0017's simplification spirit.
+
+### A2. Model BW serialization inside SRAM with its own resource
+
+Rejected. Wire-side BW serialization (`drain_ns`) already captures it. An
+internal `simpy.Resource` would double-count against ADR-0015 (port/wire
+model).
+
+### A3. Handle IPCQ slot accounting in the SRAM component
+
+Rejected. As D4 makes explicit, IPCQ is a fast path that does not traverse
+fabric Transactions. If SRAM knew about IPCQ, the responsibility would split
+across two places and obscure reasoning.
+
+### A4. Capacity-aware latency from `size_mb`
+
+Rejected for now. The capacity is currently a visualizer label; introducing
+a capacity-aware timing model requires a dedicated ADR.
+
+## Consequences
+
+- SRAM's timing model is pinned at ADR level as
+  `overhead_ns + drain_ns + ResponseMsg(reverse_path)`. Any proposal to push
+  IPCQ slot latency into the SRAM component can be refused with D4.
+- D3 records that `size_mb` is timing-neutral today, so a future
+  capacity-aware model has a narrow compatibility scope.
+- D5 documents the placement-driven attachment, so changes to the SRAM
+  coordinate have a clearly bounded impact (`mesh_gen` only).
@@ -0,0 +1,199 @@
+# ADR-0042: Tile Plan Generators — GEMM/Math Pipeline Plan Builders
+
+## Status
+
+Accepted (2026-05-20).
+
+This ADR pins down `tiling.py` as a **plan-generator
+module**, not a SimPy component.
+
+ADR-0014 (PE Pipeline Execution Model) D6 (tile plan / self-routing) does not
+specify the tile-plan generation algorithm itself; this ADR fills that gap.
+
+## First action
+
+When `generate_gemm_plan(M, K, N, tile_m, tile_k, tile_n, ..., pe_prefix,
+a_pinned, b_pinned, epilogue_specs)` is called, the very first action is
+**computing tile counts and constructing the PE-component ID strings**:
+
+```
+M_tiles = max(1, ceil(M / tile_m))
+K_tiles = max(1, ceil(K / tile_k))
+N_tiles = max(1, ceil(N / tile_n))
+dma_id   = f"{pe_prefix}.pe_dma"
+fetch_id = f"{pe_prefix}.pe_fetch_store"
+gemm_id  = f"{pe_prefix}.pe_gemm"
+math_id  = f"{pe_prefix}.pe_math"
+```
+
+In short, **the plan generator's first act is "compute ceiling tile counts
+and assemble the four sub-component IDs for this PE once"**. No SimPy event
+or environment is touched — this module is a pure function.
+
+`generate_math_plan(M, N, tile_m, tile_n, ..., math_op, src_addr, dst_addr,
+pe_prefix)` likewise begins by computing `M_tiles`, `N_tiles` and assembling
+three component IDs (`dma_id`, `fetch_id`, `math_id`).
+
+## Context
+
+ADR-0014 D6 agreed that "PE_SCHEDULER, on receiving a CompositeCmd, generates
+a TilePlan and feeds self-routing tile tokens". But the **concrete plan
+generation algorithm** lives in `src/kernbench/components/builtin/tiling.py`,
+which:
+
+- Defines no component — it is a pair of **pure functions**
+  (`generate_gemm_plan`, `generate_math_plan`).
+- Does not depend on the SimPy environment, queues, op_log, or hooks.
+- Returns a `PipelinePlan` (dataclass).
+
+The original G4 analysis incorrectly described `tiling.py` as a component;
+it is in fact a plan-builder helper consumed by PE_SCHEDULER. Pinning this
+down in its own ADR (paired with ADR-0014 D6) prevents:
+
+- Ambiguity over whether plan generation belongs to PE_SCHEDULER or a
+  separate module.
+- Inconsistent rationale for stage sequences (e.g., FETCH/STORE position)
+  between GEMM and Math plans.
+- Undocumented branching rationale for `a_pinned` / `b_pinned` /
+  `epilogue_specs`.
+
+## Decision
+
+### D1. `tiling` is a pure plan-generator module, not a component
+
+`components/builtin/tiling.py` defines no `ComponentBase` subclass. It exports
+two module-level functions:
+
+- `generate_gemm_plan(...) -> PipelinePlan`
+- `generate_math_plan(...) -> PipelinePlan`
+
+There is no `tiling` node in the topology graph. It lives in `builtin/`
+because it is a direct helper for PE_SCHEDULER (ADR-0014 D6) and is
+conceptually a PE_SCHEDULER internal utility.
+
+### D2. GEMM plan stage sequence — `M → N → K` order
+
+For each `(m, n, k)` tile (default — no operand pinning, no epilogue):
+
+```
+[DMA_READ(A)] → [DMA_READ(B)] → FETCH → GEMM
+                                ↑
+                                ↓
+(last k tile only)              [MATH(output_tile)]* → STORE → DMA_WRITE
+```
+
+`k_tile` epilogue inserts a MATH stage immediately after GEMM on every
+K-tile; `output_tile` epilogue inserts MATH stages once per `(m, n)` after
+the final K-tile but before STORE/DMA_WRITE. The K-loop accumulator stays
+in the register file across K-tiles — STORE/DMA_WRITE happens only when
+`last_k`.
+
+### D3. Operand pinning — `a_pinned` / `b_pinned`
+
+If a caller passes `a_pinned=True`, **the A DMA_READ is omitted from every
+(m, n, k) tile**. Semantically: the caller (e.g., `tl.composite`) has already
+staged all of A in TCM via a prior `tl.load`, and signals so to the plan
+generator.
+
+The branch is made at plan time (not at runtime). Therefore the stage record
+count in op_log changes deterministically with pinning, and sweep analyses
+(e.g., gemm_sweep's stage record count) see this decision directly.
+
+### D4. Epilogue scope — `k_tile` vs `output_tile`
+
+`epilogue_specs` is an iterable of op-spec objects. Each op object is
+expected to have:
+
+- `op.kind: str` — math op name (e.g., `"dequant"`, `"bias"`, `"relu"`,
+  `"scale"`). Placed into the stage's `params["op_kind"]`.
+- `op.scope: Scope` — `Scope.K_TILE` or `Scope.OUTPUT_TILE` (`Scope` enum
+  in `kernbench.common.pe_commands`).
+- Op-specific extras (e.g., `bias`, `scale`, `factor`) — currently not used
+  by the plan generator; consumed at runtime by PE_MATH.
+
+The plan generator partitions by `getattr(o, "scope", None)`:
+
+- `scope == Scope.K_TILE`: adds a MATH stage right after GEMM on every K-tile.
+- `scope == Scope.OUTPUT_TILE`: adds MATH stages just before STORE on the
+  last K-tile per `(m, n)`.
+
+Ops with neither `scope` value (e.g., missing attribute) are **dropped
+silently** — `getattr(..., None) == Scope.X` is False for both. Picking a
+default (`output_tile`) is the **caller's responsibility** (e.g.,
+`tl.composite`), not the plan generator's. This aligns with ADR-0014's
+composite epilogue contract.
+
+`Scope` is imported lazily inside the function to avoid the circular path
+`pe_commands ← pe_types ← tiling`. This is intentional and not a refactor
+target — keeping `tiling` free of compile-time `pe_commands` dependencies
+preserves the module boundary (D1).
+
+### D5. Math plan stage sequence — `M → N` order
+
+For each `(m, n)` tile:
+
+```
+DMA_READ → FETCH → MATH → STORE → DMA_WRITE
+```
+
+There is no K dimension, so concepts like epilogue or accumulator residency
+do not apply. PE_FETCH_STORE's register-file accounting follows the same
+pattern as the GEMM plan.
+
+### D6. Plans are data — no SimPy dependency
+
+`PipelinePlan` is a dataclass in `pe_types.py` holding `tiles:
+list[TilePlan]`. Each `TilePlan` holds `stages: tuple[Stage, ...]`. The plan
+itself is near-immutable (only `Stage.params: dict` is mutable) and holds no
+SimPy objects.
+
+At runtime, PE_SCHEDULER consumes the plan's first stage, builds a `TileToken`,
+and feeds it into the pipeline. The TileToken carries `plan: TilePlan`,
+`stage_idx: int`, and a cached `params: dict`. Self-routing proceeds by
+`TileToken.advance()` caching the next stage's `params` (ADR-0014 D6).
+
+### D7. Plan generator contract — pure, deterministic, idempotent
+
+Two calls with identical inputs return identical `PipelinePlan` instances
+(including `TilePlan.stages` order). This contract aligns with ADR-0014 D6's
+"deterministic tile dispatch order".
+
+No side effects (no SimPy events, no file I/O, no global state) — tests can
+call the generators directly without an environment object (some cases in
+`tests/test_pe_pipeline.py` rely on this).
+
+## Alternatives Considered
+
+### A1. Make tiling a component (e.g., PE_PLANNER)
+
+Rejected. Plan generation consumes no SimPy time — it is a pure decision
+algorithm. Making it a component would (a) add unnecessary infrastructure
+(inbox, resources), and (b) split PE_SCHEDULER's flow into "receive plan"
+plus "feed tiles", inserting a meaningless hop.
+
+### A2. Move plan generation into PE_SCHEDULER as methods
+
+Rejected (currently). Module separation provides (1) testability and
+(2) extensibility for additional plan algorithms (e.g., DTensor-aware) —
+add a new function. If plan kinds proliferate enough to require explicit
+dispatch, a future ADR can introduce a plan factory on PE_SCHEDULER.
+
+### A3. Make plans fully immutable (frozen dataclass + tuple)
+
+Partially adopted. `Stage` and `TilePlan` are dataclasses but not frozen,
+because `Stage.params: dict` is populated at plan-generation time and read
+at runtime (cached by TileToken on advance). Moving dict → frozendict pays
+migration cost without enough benefit. Convention: do not mutate after
+generation.
+
+## Consequences
+
+- `tiling.py` is documented as a plan-generator module, not a component —
+  preempting future G4-style "this component lacks an ADR" analyses.
+- The GEMM plan's stage sequence (D2) and pinning / epilogue branching
+  (D3 / D4) are pinned, providing a clear interpretation basis for sweep
+  analyses (e.g., `scripts/gemm_sweep.py`'s stage record counts).
+- The plan generator's pure contract (D7) enables environment-free testing
+  in line with ADR-0013 (verification strategy).
+- Future plan kinds (DTensor-aware, K-major, ...) follow D1 / D6 / D7 as a
+  baseline — just add a new function.
@@ -0,0 +1,135 @@
+# ADR-0043: Allreduce Evaluation Harness — `tests/sccl/`
+
+## Status
+
+Accepted
+
+Documents the `tests/sccl/` evaluation harness; verified against the
+implementation (constants, file set, and sweep dimensions cross-checked).
+
+**Amended by ADR-0054**: the driver core, sweeps, and renderers moved into
+the `milestone-1h-ccl` bench (single home); `tests/sccl/_allreduce_helpers.py`
+now re-exports from it (keeping the pytest-only param builders +
+`_run_distributed` wrapper local). The figure tests are unchanged.
+
+## Context
+
+ADR-0032 defines the intercube all-reduce *algorithm*; ADR-0023/0024/0027
+define the IPCQ backend, the rank=SIP launcher, and `mp.spawn`. None of
+them describe **how the allreduce is exercised and characterized** — the
+correctness tests, the latency/buffer-kind sweeps, and the derived plots.
+ADR-0013 (verification strategy) is the general policy; this ADR pins the
+concrete allreduce harness so the "evaluation" half of the work is
+documented, not just the implementation.
+
+The harness lives under `tests/sccl/` (the package created when the
+allreduce tests were consolidated). It supersedes the earlier flat
+`tests/test_allreduce_multidevice.py` + `tests/test_distributed_*` layout.
+
+## Decision
+
+### D1. Drive evaluation through the public `torch.distributed` path
+
+Correctness and the sweeps run the collective through the real DDP-shaped
+path — `init_process_group(backend="ahbm") → mp.spawn → dist.all_reduce`
+(ADR-0024/0027) — not the lower-level `ctx.launch`. A shared helper
+`_run_distributed(tmp_path, monkeypatch, topo_path, corr_id, n_elem)` in
+`tests/sccl/_allreduce_helpers.py` builds the engine, runs the workers, and
+returns `(engine, n_cubes)`. `monkeypatch.chdir` points the backend's
+`load_ccl_config()` (cwd lookup) at a per-case temp `ccl.yaml`.
+
+A direct-launch reference (`run_allreduce`) is retained in the same helper
+module — not for the distributed tests, but because the IPCQ buffer-kind /
+root-center micro-tests under `tests/` import it.
+
+### D2. One file per evaluation concern
+
+| File | Concern | `torch.distributed`? |
+|---|---|---|
+| `test_allreduce_ring_torus_mesh.py` | correctness across ring_1d / torus_2d (2×3) / mesh_2d_no_wrap (2×3) | yes |
+| `test_distributed_default_topology.py` | full path on `topology.yaml` as-is | yes |
+| `test_plot_latency_sweep.py` | latency sweep rows (n_elem × topology) | yes |
+| `test_plot_buffer_kind_sweep.py` | TCM/SRAM/HBM sweep rows | yes |
+| `test_plot_topology_diagram.py` | topology.png (pure matplotlib) | no |
+| `test_plot_comparison_fsim.py` | broken-axis model-vs-FSIM comparison | no |
+| `test_intercube_root_center.py` | ADR-0032 center-root latency guard (direct path) | no |
+
+`_allreduce_helpers.py` holds the shared plumbing (driver, config writers,
+sweep/buffer-kind constants, plot aggregators, topology-diagram + FSIM
+comparison emitters). It is not collected (no `test_` prefix).
+
+### D3. Latency metric — critical-path `pe_exec_ns`
+
+The reported latency per config is `crit_ns = max(pe_exec_ns)` over
+`engine._results` — the slowest rank's PE execution time. This is the
+number plotted on every latency chart and recorded in `summary.csv`.
+
+### D4. Sweep dimensions
+
+- **Latency sweep**: `n_elem ∈ {8, 32, 64, 128, 512, 1024, 2048, 4096,
+  8192, 16384, 32768, 49152}` (16 excluded — collides with `n_cubes`) ×
+  topology ∈ {ring_1d (6), torus_2d 2×3 (6), mesh_2d_no_wrap 2×3 (6)}.
+- **Buffer-kind sweep**: `buffer_kind ∈ {tcm, sram, hbm}` × a smaller
+  `n_elem` grid, on torus_2d 6-SIP (3×2). buffer_kind is set in the temp
+  `ccl.yaml` (read by the backend at `init_process_group`, ADR-0023 D6).
+
+The 2×3 / 3×2 grids exercise the explicit-`w/h` SIP resolution
+(ADR-0024 D5).
+
+### D5. Derived plots via `pytest_sessionfinish` aggregators
+
+Sweep tests are xdist-friendly: each parametrized case writes one JSON row
+to a staging dir. The conftest `pytest_sessionfinish` hook (controller node
+only) calls the aggregators in `_allreduce_helpers.py`:
+
+- `_aggregate_sweep_plots()` → per-topology PNGs + `summary.csv`
+- `aggregate_buffer_kind_plot()` → the TCM/SRAM/HBM comparison PNG + csv
+
+The topology-diagram and FSIM-comparison figures are emitted directly by
+their own `test_plot_*` tests (no row staging — they are pure functions of
+`topology.yaml` and `summary.csv` respectively). All outputs land in
+`docs/diagrams/allreduce_latency_plots/` and are **derived artifacts** per
+CLAUDE.md (consistent-with-ADRs, no Phase-2 gate).
+
+### D6. The FSIM comparison reference is a hardcoded constant
+
+`emit_comparison_fsim_plot()` overlays the model curves against a single
+external FSIM single-device reference (`366 µs`), held as a literal — there
+is no external data file. The "measured" series comes from the simulator
+(`op_log` GEMM count, `composite_window_ns`); the "theoretical" series is a
+hand-derived analytical model (the same one ADR-0044 D5 flags as
+ADR-unverified).
+
+## Consequences
+
+### Positive
+
+- The allreduce is evaluated through the same API a real DDP script uses,
+  so the harness doubles as an integration test of ADR-0024/0027.
+- Figures regenerate on every `pytest` run from committed data; no manual
+  plot step.
+- Rectangular-grid sweeps gave the regression coverage that surfaced the
+  ADR-0024 D5 `w/h` fix.
+
+### Negative / limitations
+
+- The full latency sweep runs in the default `pytest` (~minutes); it is not
+  marked `slow`. (Contrast ADR-0044, where the GEMM sweep is `slow`.)
+- `test_intercube_root_center.py` carries a latency *threshold* assertion
+  (ADR-0032 center-root guard) — the only absolute-latency assertion in the
+  suite; it is sensitive to latency-model changes (ADR-0033).
+
+## Dependencies
+
+- **ADR-0013**: verification strategy (general policy this specializes).
+- **ADR-0023 / ADR-0024 / ADR-0027**: IPCQ backend, rank=SIP launcher,
+  `mp.spawn` — the path D1 drives.
+- **ADR-0032**: the algorithm under evaluation; D4 grids exercise its
+  topology branches.
+- **ADR-0044**: the sibling GEMM evaluation harness.
+
+## Open questions
+
+- Should the latency sweep be marked `slow` for parity with the GEMM sweep?
+- Should the FSIM reference move from a hardcoded constant to a versioned
+  data file?
@@ -0,0 +1,136 @@
+# ADR-0044: GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/`
+
+## Status
+
+Accepted
+
+Documents the GEMM evaluation/characterization harness; verified against the
+implementation (constants, tile sizes, figure set, and the script↔test
+split cross-checked). The D5/D6 caveats are recorded limitations, not
+inaccuracies.
+
+**Amended by ADR-0054**: the sweep + renderers moved into the
+`milestone-1h-gemm` bench (single home); `scripts/gemm_sweep.py` and
+`tests/gemm/` now re-export from it. D1/D2's "data generation stays a manual
+script / heavy work is opt-in" is superseded by the eval-bench pattern (one
+bench regenerates everything; `MILESTONE_FAST=1` reuses the committed JSON).
+
+## Context
+
+ADR-0014 (PE pipeline) and ADR-0042 (tile-plan generators) define the GEMM
+*implementation*; ADR-0033 defines the latency model. None of them describe
+**how GEMM performance is swept and characterized** — the shape/variant
+sweep that produces the timing data, and the figures that interpret it.
+This ADR pins that harness.
+
+Unlike the allreduce harness (ADR-0043), the GEMM sweep is **heavy** (24
+sim runs: 8 shapes × 3 operand-staging variants; the `512` shape alone is
+2048 tiles). That weight drives the split below.
+
+## Decision
+
+### D1. Two-layer split — heavy data generation (script) vs. fast figures (tests)
+
+- **Data generation stays a manual script**: `scripts/gemm_sweep.py` runs
+  `matmul-composite` (ADR-0042 plans) across shapes × variants via the same
+  `run_bench` path the CLI uses, harvests `result.engine.op_log`, and
+  writes `docs/diagrams/gemm_sweep.json` (per-stage / per-engine wall-clock
+  + occupancy + record counts + pe/composite windows).
+- **Figure rendering is test-generated**: `tests/gemm/` reads the committed
+  `gemm_sweep.json` and renders matplotlib PNGs into
+  `docs/diagrams/gemm_plots/`. These tests are fast and run by default.
+
+Rationale: a slide-deck-scale sim sweep does not belong in every `pytest`
+run, but the figures (cheap, deterministic) should regenerate freely and be
+guarded by CI. This mirrors CLAUDE.md's script-vs-test split (scripts for
+heavy/manual generation; tests for fast assertions).
+
+### D2. Slow regenerator test wraps the script
+
+`tests/gemm/test_gemm_sweep.py` is marked `@pytest.mark.slow` (excluded by
+the default `addopts: -m "not slow"`). It invokes `scripts/gemm_sweep.py`
+via subprocess to regenerate `gemm_sweep.json` on demand
+(`pytest -m slow tests/gemm/test_gemm_sweep.py`). The sweep logic has a
+single home (the script); the test only wraps it, so there is no duplicated
+sim-driving code.
+
+### D3. Figure set (3 charts, `load_ref` variant)
+
+| Test | PNG | Content |
+|---|---|---|
+| `test_plot_gemm_stage_breakdown.py` | `gemm_stage_breakdown.png` | per-stage engine wall-clock (DMA in / Fetch / GEMM / DMA out) |
+| `test_plot_gemm_mac_utilization.py` | `gemm_mac_utilization_measured.png` | GEMM util % + useful eff % |
+| `test_plot_gemm_mac_utilization.py` | `gemm_mac_utilization_theoretical_vs_measured.png` | theoretical vs simulator-measured util/eff |
+
+`tests/gemm/_gemm_plot_helpers.py` holds the shared renderers (series logic
+mirrors the GEMM `_render_*` functions in `scripts/build_overview_slides.py`,
+which still draws these natively in the PPTX). Not collected (no `test_`
+prefix). Each `test_plot_*` skips if `gemm_sweep.json` is absent.
+
+### D4. Tile sizes are data-driven; under-tile shapes are flagged
+
+Tile sizes are read from `gemm_sweep.json` (`tile_sizes`), which the sweep
+records from `PeSchedulerComponent.TILE_M/K/N = 32/64/32` — the authoritative
+source. Shapes with `M<TILE_M ∨ K<TILE_K ∨ N<TILE_N` are flagged
+("under-tile") on the charts. The `512³` shape is excluded from the figures
+(`EXCLUDED_SHAPES`).
+
+### D5. Theoretical model — inherited constants, NOT yet ADR-verified
+
+The "theoretical" curves use an analytical ideal-pipeline model with
+constants copied verbatim from `scripts/build_overview_slides.py`:
+
+```
+HBM_GBS = 256.0   # GB/s        T_STAGE = 16.0 ns
+D_STAGES = 3                    BPE = 2
+```
+
+**These are not yet sourced against the ADRs.** Notably ADR-0033's `256`
+is `burst_bytes` (256 B), a *different* quantity than this `256 GB/s`, and
+ADR-0033 derives bandwidth as `pc_bw_gbs = hbm_to_router_bw_gbs / num_pcs`.
+`T_STAGE`/stage-count are not traced to ADR-0014 here. The model is
+therefore **consistent with the existing deck script, not verified against
+the ADRs**, and the constants are duplicated (deck + helper). Reconciling
+them (source from topology/ADR-0033/0014, de-duplicate) is deferred — see
+Open questions.
+
+### D6. Known naming caveat — `_measured` chart
+
+`gemm_mac_utilization_measured.png` currently plots the *theoretical*
+ideal-pipeline numbers (its footnote says so), only the filename says
+"measured". This is a known misnomer pending a decision to either repoint
+its content to the simulator-measured series or retitle it.
+
+## Consequences
+
+### Positive
+
+- GEMM figures are test-generated and CI-guarded, like allreduce.
+- The heavy sweep stays opt-in, keeping the default test run fast.
+- Single source for the sweep logic (the script), reused by the slow test.
+
+### Negative / limitations
+
+- The theoretical-model constants (D5) are unverified and duplicated.
+- The `_measured` figure is a misnomer (D6).
+- `build_overview_slides.py` still renders the GEMM bars natively from
+  `gemm_sweep.json` rather than embedding these PNGs — the deck rewiring to
+  consume the test artifacts is not done.
+
+## Dependencies
+
+- **ADR-0013**: verification strategy.
+- **ADR-0014 / ADR-0042**: PE pipeline + tile-plan generators — the GEMM
+  implementation the sweep measures; D4's stage record counts come from
+  ADR-0042 D2/D3.
+- **ADR-0033**: latency model — the source the D5 constants should (but do
+  not yet) trace to.
+- **ADR-0043**: the sibling allreduce evaluation harness.
+
+## Open questions
+
+- Reconcile D5 constants against `topology.yaml` / ADR-0033 / ADR-0014 and
+  de-duplicate (one source for the model parameters)?
+- Resolve the D6 `_measured` naming (repoint content vs. retitle)?
+- Rewire `build_overview_slides.py` to embed the `gemm_plots/` PNGs instead
+  of native bar-drawing?
@@ -0,0 +1,295 @@
+# ADR-0045: Bench Module Contract — registration, dispatch, and authoring
+
+## Status
+
+Accepted (2026-05-21).
+
+Unifies the `src/kernbench/benches/` registration mechanism (@bench), the
+CLI dispatch path (`kernbench run/list`), and the contract a new bench
+module must follow. ADR-0010 (CLI surface) specifies the `kernbench
+list/run` interface, but **how benches are registered and what signature
+they must follow** had no ADR-level coverage.
+
+**Extended by ADR-0054**: D5's single-config rule gains a third pattern —
+the *eval bench* (e.g. `milestone-1h-*`) drives many configs, builds its
+own per-config engines, and submits a sentinel tensor to satisfy D4.
+
+## First action
+
+When `kernbench.benches` is imported, `__init__.py` immediately calls
+`_eager_import_and_audit(__path__, __name__)`. Its first action is to
+enumerate every sibling module in the package directory via
+`pkgutil.iter_modules(__path__)` and **eagerly import** each one via
+`importlib.import_module(...)` — except modules matching either:
+
+- name `registry` (the infrastructure module itself), or
+- name starting with `_` (helper modules).
+
+At import time, each `@bench(name=..., description=...)` decorator inside
+the imported module runs, appending `(name, description, fn)` to
+`_PENDING` and adding `fn.__module__` to `_REGISTERED_MODULES`.
+
+Once imports finish, `_audit_modules(imported, _REGISTERED_MODULES)`
+runs; if any imported module did not invoke `@bench` at least once, it
+raises `RuntimeError("Bench module(s) missing @bench decorator: ...")`
+immediately. At this point indices are still unassigned — the first call
+to `list_all()` / `resolve(...)` triggers `_finalize()`, which sorts
+`_PENDING` alphabetically by name and assigns 1-based indices.
+
+In short, **the bench infrastructure's first act is "eagerly import
+every non-helper module in the package and audit that each one
+registered at least one bench"**.
+
+## Context
+
+`src/kernbench/benches/` currently holds 8 bench modules (`ccl_allreduce`,
+`gemm_single_pe`, `gpt3_qkv`, `ipcq_allreduce`, `matmul_composite`,
+`qkv_gemm`, `qkv_gemm_multi_pe`, `va_offset_verify`). Every bench follows
+the same unified flow:
+
+```
+kernbench run --topology <T> --bench <N>
+   ↓
+cli/main.py::cmd_run
+   ↓  resolve_topology(T)  + resolve(N)  + resolve_device(device_arg)
+   ↓
+runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
+   ↓  engine_factory(topology, device) → GraphEngine
+   ↓  RuntimeContext(engine, target_device, correlation_id, spec)
+   ↓
+bench_fn(ctx)        ← invokes the bench's run(torch)
+   ↓  ctx.empty/zeros/from_numpy/launch/distributed.* etc. submit work
+   ↓
+ctx.wait_all()       ← drains any outstanding handles
+   ↓
+BenchResult(completion, correlation_id, trace, traces, engine)
+```
+
+ADR-0010 covers only the CLI surface (`run/list/probe/web`); ADR-0007
+covers only the runtime API ↔ sim_engine boundary. The question "what
+shape must a new bench file take?" had to be answered by grepping the
+codebase. As a result:
+
+- The @bench decorator contract (kebab-case name, non-empty description)
+  lived only in the source.
+- The bench function signature (`def run(torch)`) was a de-facto
+  convention enforced by the CLI dispatcher calling `spec.run`.
+- New bench authors learned the "helpers must use `_` prefix" rule only
+  after seeing the audit's RuntimeError.
+- The single-device convention (CLAUDE.md Part 2 CLI Semantics) and its
+  interaction with multi-SIP CCL benches was ambiguous for bench
+  authors.
+
+This ADR consolidates all of it in one place.
+
+## Decision
+
+### D1. @bench decorator contract
+
+```python
+from kernbench.benches.registry import bench
+
+@bench(name="my-bench", description="Short, complete-sentence description.")
+def run(torch):
+    ...
+```
+
+- `name`: kebab-case string matching `^[a-z][a-z0-9]*(-[a-z0-9]+)*$`.
+  Lowercase letters, digits, and dashes only; underscores forbidden;
+  must start with a letter.
+- `description`: non-empty string (stripped length > 0). Displayed
+  verbatim by `kernbench list`.
+- The decorator **returns the function unchanged** — direct invocation
+  is fine. Its only side effect is appending to `_PENDING`.
+
+Violations of the first two rules raise `ValueError` at decoration time.
+Duplicate names are caught at `_finalize()` with
+`RuntimeError("duplicate bench name: ...")`.
+
+### D2. Module-file convention
+
+Every `src/kernbench/benches/<slug>.py` must be one of:
+
+- **A bench module**: at top-level import, `@bench(...)` runs at least
+  once to register at least one bench.
+- **A helper module**: the filename starts with `_` (e.g.,
+  `_shared_helpers.py`). `iter_modules` skips it.
+
+The audit (`_audit_modules`) rejects any non-helper that fails to call
+`@bench`. Intended consequence: dropping a new file into `benches/`
+automatically registers its benches, and helper modules are clearly
+flagged by their filename prefix alone.
+
+### D3. The bench function signature is `def run(torch)`
+
+The decorator does not enforce a function name, but **CLI dispatch calls
+`spec_entry.run`** (the decorated callable). The convention is therefore:
+
+- Function name: `run`. Other names work, but always use `run` for
+  readability and grep-ability.
+- Argument: a single positional `torch`. In practice this is a
+  `RuntimeContext` instance exposing PyTorch-style namespaces
+  (zeros/empty/launch/distributed/...) — see ADR-0024 D3.
+- Return value: any (`Any`). `run_bench` ignores it and tracks
+  completion via `ctx.handles()` / `engine.get_completion()`.
+
+The `torch` name imitates a PyTorch-compatible idiom; the actual PyTorch
+module is not passed in (aligned with ADR-0024's "rank = SIP" launcher
+convention).
+
+### D4. A bench must submit at least once
+
+If `ctx.handles()` is empty after the bench returns, `run_bench` reports
+`BenchResult.completion = ok=False, error_code="NO_REQUESTS"`. So a
+meaningful bench must invoke at least one of:
+
+- Tensor-creation APIs: `torch.zeros(...)`, `torch.empty(...)` — these
+  internally submit `MmuMapMsg` and (for zeros) `MemoryWriteMsg`.
+- Kernel-launch API: `torch.launch(name, fn, *args)` — submits per-SIP
+  `KernelLaunchMsg`.
+- (Exception) Empty placeholder benches: e.g.,
+  `ipcq_allreduce.py`'s `print(...)`-only stub will receive a
+  NO_REQUESTS result. CI is expected to recognize and handle placeholder
+  benches specially.
+
+### D5. Single-device convention + multi-SIP exception (ADR-0024/0027)
+
+CLAUDE.md Part 2 CLI Semantics' **"benchmarks MUST remain
+single-device"** rule is interpreted as follows:
+
+- **Standard bench (single-SIP use)**: define tensor placement with
+  `dp = DPPolicy(...)` and launch with `torch.launch(...)`. The SIP
+  index is chosen by `--device` (CLI's responsibility).
+- **CCL bench (multi-SIP use)**: as an exception, use
+  `torch.distributed.init_process_group(backend="ahbm")` plus
+  `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` for the
+  rank = SIP pattern (ADR-0024 D3). `--device` is ignored (or treated
+  as `all`); each spawned worker calls `torch.ahbm.set_device(rank)` to
+  bind to its SIP.
+
+Multi-device patterns outside these two (e.g., one bench function
+launching across multiple SIPs in the same process) are forbidden by
+this ADR. Even with `--device all`, the bench runs once; multi-SIP use
+inside that single run must follow D5's second pattern.
+
+### D6. Name/index resolution (`resolve`)
+
+`resolve(identifier: str)` returns a BenchSpec via:
+
+1. If `identifier.isdigit()`: convert to int and find the spec where
+   `index ==` that value. If none, `ValueError("No bench with index
+   ...")`.
+2. If `identifier in _REGISTRY`: direct lookup.
+3. Otherwise: `ValueError("Unknown bench ...")`.
+
+Empty or whitespace-only identifiers raise `ValueError("bench
+identifier must be a non-empty string.")`.
+
+The CLI passes `--bench` directly to `resolve`, so users can use either
+`kernbench run --bench gemm-single-pe` or `kernbench run --bench 2`.
+
+### D7. Indices are not a stable API
+
+`_finalize()` sorts `_PENDING` alphabetically by name and assigns
+1-based indices. Adding a new bench can shift existing benches'
+indices. Therefore:
+
+- Human-interactive use: indices are fine.
+- Scripts / CI automation: always use the name.
+
+This caveat is documented in `registry.py`'s module docstring.
+
+### D8. Surface RuntimeContext exposes to benches
+
+A bench's `torch` parameter may legitimately use:
+
+- **Tensor creation**: `torch.empty(shape, dtype=..., dp=DPPolicy(...),
+  name=...)`, `torch.zeros(...)`, `torch.from_numpy(arr)`. All submit
+  host-side metadata plus device deployment (`MmuMapMsg` +
+  `MemoryWriteMsg`).
+- **Kernel launch**: `torch.launch(kernel_name, kernel_fn, *args)` —
+  converts `(Tensor, int, float)` positional args to `TensorArg` /
+  `ScalarArg`, submits per-SIP `KernelLaunchMsg`, and drains.
+- **Synchronization**: `torch.wait(handle)`, `torch.wait_all()`
+  (`run_bench` calls the latter automatically).
+- **Distributed**: `torch.distributed.init_process_group(backend="ahbm")`,
+  `torch.distributed.get_world_size()`,
+  `torch.distributed.all_reduce(t, op=...)` (ADR-0024/0027).
+- **Multi-process (rank = SIP)**:
+  `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` (ADR-0024 D3 /
+  ADR-0027).
+- **Device binding**: `torch.ahbm.set_device(rank)` or
+  `torch.accelerator.set_device_index(rank)` (both point to the same
+  namespace).
+- **IPCQ install**: `torch.install_ipcq(algorithm=..., ccl_yaml=...)`
+  (ADR-0023 D10).
+- **Spec lookup**: `torch.spec` — the dict produced by the topology
+  builder (system / cube_mesh / HBM parameters etc.). Use it so the
+  bench does not hardcode topology.yaml values.
+
+Benches must not access RuntimeContext private members (`_handles`,
+`_traces`, `_allocators`, etc.) directly. This aligns with ADR-0007's
+layer-boundary spirit: bench → runtime API → sim_engine flows in one
+direction.
+
+### D9. Environment-variable parameterization is allowed
+
+Benches may parameterize themselves via `os.environ.get(...)`, as
+`matmul_composite.py` does for `MATMUL_M`, `MATMUL_K`, `MATMUL_N`,
+`MATMUL_DTYPE`, `MATMUL_VARIANT`. Rationale:
+
+- The bench function signature is fixed by D3 to `def run(torch)`, so
+  positional/keyword arguments cannot carry parameters.
+- The env-var pattern is a natural hook for operational sweeps (e.g.,
+  `MATMUL_VARIANT`).
+- External drivers such as `scripts/gemm_sweep.py` (ADR-0044) consume
+  this hook (it sets `MATMUL_M/K/N/VARIANT` at
+  `scripts/gemm_sweep.py:115-118`).
+
+When environment variables alter bench behavior, the module docstring
+must list every variable used (`matmul_composite.py` is the canonical
+example).
+
+## Alternatives Considered
+
+### A1. An explicit manifest file (YAML) listing benches
+
+Rejected. The `@bench` + audit pattern guarantees "drop in file → auto-
+register", concentrating cognitive cost in one place (the file itself).
+A separate manifest is prone to drift, and helper separation is already
+clear via the `_` prefix.
+
+### A2. Allowing the bench's entry-point name in the decorator
+(`@bench(name=..., entry="run_xxx")`)
+
+Rejected. Breaks the simplicity of dispatch (`spec.run` is a single
+callable). The `run` convention is sufficient; variants can register
+multiple `@bench`-decorated functions in the same module.
+
+### A3. A separate `@multi_device_bench` decorator for CCL
+
+Rejected. The two patterns named in D5 (single + ADR-0024 multi-SIP)
+cover all 8 current benches. A separate decorator would force dispatch
+to branch and add complexity; the multi-SIP intent is already obvious
+from the bench's `init_process_group(...)` call.
+
+### A4. Make indices a stable API (registration order or explicit
+`index=` argument)
+
+Rejected. D7's trade-off favors user-friendliness — alphabetically
+sorted 1-based indices read naturally in the `list` output. Scripts can
+use names.
+
+## Consequences
+
+- "How to add a bench" is consolidated in one ADR — new authors only
+  need to read D1-D3 and D8 without grepping source.
+- The `_`-prefixed helper-module pattern is legitimized at ADR level,
+  so future `benches/_*.py` shared helpers are free to be added.
+- The CLI's single-device convention and CCL's multi-SIP exception are
+  shown to be consistent (D5) — they are orthogonal.
+- The rationale for ADR-0044's GEMM eval harness using env-var hooks
+  (D9) is now ADR-pinned.
+- Indices are explicitly unstable (D7), so any CI code calling
+  `kernbench run --bench 3` is flagged for review after this ADR is
+  accepted.
@@ -0,0 +1,327 @@
+# ADR-0046: TLContext — Kernel-side `tl.*` API Contract
+
+## Status
+
+Accepted (2026-05-22).
+
+Documents the set of `tl.*` primitives exposed by
+`src/kernbench/triton_emu/`'s `TLContext`, their semantics, and the two
+execution-mode contracts (command-list / greenlet runner). ADR-0014/0020
+defines the PE pipeline and the 2-pass execution model, but **the `tl.*`
+surface that bench kernel functions call** had no ADR-level coverage.
+
+## First action
+
+When `TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id,
+num_cubes, scratch_base, scratch_size)` is instantiated, the first action
+is to initialize six categories of state:
+
+- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
+  values that `tl.program_id` / `tl.num_programs` will return.
+- `self._dispatch_cycles` — cycle count emitted as `PeCpuOverheadCmd(cycles)`
+  at the start of every `tl.*` API call.
+- `self._runner` — `KernelRunner` instance (present → greenlet mode;
+  absent → command-list mode).
+- `self._commands: list[PeCommand] = []` — command-list accumulator
+  (command-list mode only).
+- `self._handle_counter = 0`, `self._completion_counter = 0` — counters
+  for generating TensorHandle / CompletionHandle ids.
+- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
+  PE-local scratch region (used for math/dot/composite output handle
+  addresses).
+
+In short, **TLContext's first act is "record where (sip/cube/pe) and at
+what scale (num_programs/num_cubes) this kernel instance runs, and pick
+its dispatch mode (runner present or not)"**. No SimPy event is created
+and no command is emitted at this moment.
+
+The runtime first action happens when the kernel function first calls a
+`tl.<api>()`. The standard entry for every `tl.*` API is:
+
+1. Call `self._emit_dispatch_overhead()` — if `dispatch_cycles > 0`,
+   immediately `_emit` a `PeCpuOverheadCmd(dispatch_cycles)`.
+2. Per-API processing (TensorHandle creation, command construction).
+3. `self._emit(cmd)` — in runner mode this `greenlet.switch()`es the cmd
+   to SimPy; in command-list mode it appends to `self._commands`.
+
+## Context
+
+The `tl.*` surface consists of `TLContext`'s methods, and the `tl`
+parameter received by a kernel function is one of these objects. The
+contract the user (bench author) sees:
+
+- Which primitives exist.
+- What data flow each primitive triggers (DMA / compute / IPCQ /
+  metadata-only).
+- How a TensorHandle's `space` and `addr` are decided.
+- The difference between command-list and greenlet modes.
+
+ADR-0014 (PE pipeline) defines the PeCommands consumed by PE_SCHEDULER,
+but how `tl.*` emits them is a code-only convention. ADR-0020 (2-pass
+data execution) mentions greenlet mode in D3 but does not pin down the
+signature difference (return-value handling) between the runner /
+non-runner paths. This ADR fills the gap.
+
+## Decision
+
+### D1. The `tl` parameter is a `TLContext` instance
+
+A bench kernel function has the signature:
+
+```python
+def _kernel(arg1, arg2, ..., tl, **kwargs):
+    ...
+```
+
+`tl` is a `kernbench.triton_emu.tl_context.TLContext` instance. The name
+imitates real Triton's `triton.language` module; the actual Triton
+module is **not** passed in.
+
+The kernel is plain Python — no `yield` or `async`. `tl.*` calls produce
+SimPy events, but to the caller they appear synchronous because in
+greenlet mode the KernelRunner relays between SimPy and the kernel
+(ADR-0020 D3).
+
+### D2. Two execution modes — command-list / greenlet runner
+
+- **Command-list mode (`runner is None`)**: `tl.*` calls append PeCommand
+  to `self._commands`. DMA / GEMM / Math consume no SimPy time and return
+  metadata-only TensorHandles (`data=None`). PE_SCHEDULER / sim_engine
+  later replays the command sequence in time.
+
+- **Greenlet runner mode (`runner is not None`)**: `tl.*` calls
+  `self._emit(cmd)` → `runner.switch_to_simpy(cmd)`, handing control to
+  the parent greenlet (SimPy). The parent distributes the cmd to
+  components, consumes SimPy time, and (for DMA reads) returns real numpy
+  data. The kernel receives the result and continues to the next line
+  (the data-aware execution model from ADR-0020 D3).
+
+The choice of mode is decided by whether a KernelRunner is injected into
+the TLContext. The `tl.*` methods themselves are mode-blind — they go
+through `_emit()` uniformly.
+
+### D3. Primitive categories
+
+#### D3.1. Reference (no DMA, metadata only)
+
+- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: create a handle
+  referencing HBM data without issuing DMA. Used when the scheduler
+  streams the data per-tile (e.g., the b operand of a composite GEMM).
+
+#### D3.2. Data movement (blocking, DMA engine)
+
+- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → handle.
+  Emits `DmaReadCmd`. In greenlet mode the returned handle's `.data`
+  carries real numpy data; in command-list mode it is a placeholder.
+  The handle has `space="hbm"`, `pinned=True`.
+- `tl.store(ptr, handle) -> None`: TCM → HBM. Emits `DmaWriteCmd`. In
+  greenlet mode, when `handle.data` is present, `_store.write("hbm",
+  ptr, data)` runs first (visibility = issue time, ADR-0020 D3).
+
+#### D3.3. GEMM / compute (blocking)
+
+- `tl.dot(a, b) -> TensorHandle`: `a @ b`. Both operands must live in
+  TCM; shapes `(M,K) × (K,N) → (M,N)`. Emits `GemmCmd`; the output
+  handle is allocated from PE-local scratch via
+  `_make_compute_out(shape, dtype)`.
+- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
+  acc_dtype=None, tile_shape=None) -> CompletionHandle`: non-blocking
+  tiled pipeline. Emits `CompositeCmd`. `epilogue` is a list of dicts,
+  each with `"op"` plus op-specific fields and an optional `"scope"`
+  (k_tile / output_tile). Unknown ops or missing fields raise
+  ValueError immediately. The returned CompletionHandle synchronizes
+  via `tl.wait(h)`.
+
+#### D3.4. Math: unary (blocking)
+
+- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
+  `tl.cos(x)`, `tl.sin(x)` — each emits `MathCmd(op=<name>,
+  inputs=(x,), out=)`. `out` is scratch-allocated with the same
+  shape/dtype as `x`.
+
+#### D3.5. Math: binary (blocking)
+
+- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
+- `tl.fma(a, b, c)` — `a*b + c`. Three inputs.
+- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
+- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
+- `tl.softmax(x, axis=-1)` — a single `MathCmd(op="softmax")` so timing
+  accounts at one dispatch. Phase 2 DataExecutor expands it to the
+  canonical (x-max → exp → sum → div) sequence.
+
+#### D3.6. Reduction (blocking)
+
+- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — return an
+  output handle with the axis size collapsed to 1. Emit
+  `MathCmd(op=<name>, inputs=(x,), out=, axis=axis)`.
+
+#### D3.7. Index / scalar (PE_CPU, no engine)
+
+- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE
+  index), `axis==1` → cube_id (ADR-0022).
+- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (PEs per
+  cube), `axis==1` → num_cubes.
+- `tl.arange(start, end, dtype="i32") -> TensorHandle`: an index range
+  in TCM. No command emitted.
+- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape,
+  value, dtype="f16") -> TensorHandle`: TCM placeholder. No command
+  emitted.
+
+#### D3.8. Scalar helpers (no command, no engine)
+
+- `TLContext.cdiv(a, b) -> int` (static): ceiling division
+  `-(-a // b)`. Mirrors real Triton's `tl.cdiv`.
+
+#### D3.9. Metadata-only (no compute, no DMA)
+
+- `tl.trans(x) -> TensorHandle`: a new handle with the last two dims
+  swapped. Shares `addr` and `data`; no command emitted.
+
+#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
+
+- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
+  dtype="f16", space="tcm") -> None`: blocking send. Accepts either
+  handle form or raw-address form. Emits `IpcqSendCmd`. The handle's
+  `.data` snapshot rides along on the command — avoiding the race
+  where a later inbound IPCQ overwrites the slot before the outbound
+  PE_DMA reads it.
+- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
+  dst_space=None) -> TensorHandle`: blocking recv. Providing both
+  `dst_addr` and `dst_space` enters "copy_to_dst" mode; otherwise
+  "return_slot" mode. In greenlet mode the handle's `.data` carries
+  the real data.
+- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
+  **DIAGNOSTIC ONLY**. Has the same blocking-arrival semantics as
+  `tl.recv` but skips the slot-read latency charge (slot-IO + PE↔bank
+  fabric drain). Used in the pe2pe overview plot for an apples-to-apples
+  comparison against `tl.store`. Production kernels MUST NOT use it —
+  the diagnostic flag is isolated in its own command branch
+  (`consume=False`) so it cannot be accidentally enabled.
+- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
+  recv. Returns a `RecvFuture`; resolved later by `tl.wait(future)`.
+
+#### D3.11. Composite + control
+
+- `tl.composite(...)`: see D3.3.
+- `tl.wait(handle=None)`: wait on a `CompletionHandle` (composite), a
+  `RecvFuture` (async recv), or `None` (all pending composites).
+- `tl.cycles(n)`: declare a scalar PE_CPU overhead. Emits
+  `PeCpuOverheadCmd(cycles=n)`.
+
+### D4. TensorHandle arithmetic operators — thread-local TLContext
+
+At module load, `tl_context.py::_enable_tensor_ops()` runs and patches
+`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__`. Each
+operator calls `_binary_math` on the active TLContext stored in a
+module-level thread-local `_ctx`.
+
+So inside a kernel, `c = a + b` is equivalent to emitting
+`MathCmd(op="add", inputs=(a, b), out=)` and returning a new
+TensorHandle.
+
+Active-TLContext management:
+
+- `TLContext._set_active(ctx)`: set the active ctx for the current
+  thread/greenlet.
+- `TLContext._get_active()`: read it (RuntimeError if unset).
+- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. Sets active
+  on entry, runs the kernel, restores `None` on exit.
+
+`KernelRunner` re-asserts `_set_active(tl)` inside its `_switch_kernel`
+just before resuming the kernel, so a sibling PE runner that overwrote
+the thread-local context is correctly recovered.
+
+### D5. Scratch allocator — compute output handles
+
+Ops that produce a result — `tl.dot`, `tl.exp`, `tl.add` (via
+TensorHandle `__add__`), etc. — call `_make_compute_out(shape, dtype)`
+to obtain a 16-byte-aligned scratch address. The address is published
+with `space="tcm"`, so the handle can later be the source of a
+`tl.send` / `tl.store`.
+
+When `_scratch_base == 0` (e.g., command-list mode), the address is 0
+and the handle cannot be a send/store source (in that case, only
+`tl.load`-returned handles are valid sources).
+
+When the cursor exceeds `_scratch_size` (default 1 MiB), a
+RuntimeError is raised. The cursor must reset between kernel
+invocations (current code naturally satisfies this: KernelRunner
+creates a fresh TLContext each time).
+
+### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
+
+Every non-metadata `tl.*` call starts with `_emit_dispatch_overhead()`,
+which — when `dispatch_cycles > 0` — emits
+`PeCpuOverheadCmd(dispatch_cycles)`. This models the cycles PE_CPU
+spends dispatching the command.
+
+Defaults:
+
+- `TLContext.__init__`'s `dispatch_cycles` parameter default: `1` cycle.
+- TLContext built by `KernelRunner`: `0` cycles (greenlet mode handles
+  cycle accounting differently — aligned with ADR-0020 D3 intent).
+
+### D7. Kernel registry (`triton_emu/registry.py`)
+
+A separate `_kernels: dict[str, Callable]` holds the name → function
+mapping:
+
+- `register_kernel(name, fn)`: ValueError on duplicate.
+- `get_kernel(name)`: KeyError if missing.
+- `clear_registry()`: test-only.
+
+`RuntimeContext.launch(kernel_name, kernel_fn, *args)` overwrites
+`_kernels[kernel_name] = kernel_fn` on every call (last-call-wins,
+idempotent) — consistent with ADR-0045 D8's `launch` behavior.
+
+PE_CPU looks up `KernelRef.name` in the registry and runs the function
+through KernelRunner.
+
+## Alternatives Considered
+
+### A1. Fold `tl.*` into ADR-0014 / ADR-0020
+
+Rejected. ADR-0014 covers the PE pipeline (sim_engine-side consumption
+of PeCommands); ADR-0020 covers 2-pass execution (Phase 1 timing /
+Phase 2 data). The `tl.*` surface is what the kernel author touches; a
+dedicated ADR improves findability and onboarding.
+
+### A2. Deprecate command-list mode
+
+Rejected (currently). Simple unit tests and kernel verification benefit
+from the lighter command-list path — it exposes a PeCommand sequence
+inspector without requiring greenlet machinery. When greenlet-mode
+semantics (real data, Phase 2) are needed, D2 explicitly selects them.
+
+### A3. Remove TensorHandle arithmetic operators
+
+Rejected. They mimic real Triton kernel ergonomics (e.g., `c = a + b`),
+and the thread-local active-ctx pattern works cleanly. The explicit
+function-form (`tl.add(a, b)`) is also exposed in D3.5, so the
+operators are syntactic sugar.
+
+### A4. Expand softmax into the explicit sequence (max → exp → sum → div)
+
+Partially adopted. `tl.softmax` is a single `MathCmd(op="softmax")` for
+timing accounting (D3.5), but Phase 2 DataExecutor expands it to the
+canonical sequence for real-data computation. Timing model atomic,
+data model expanded — the two split intentionally.
+
+## Consequences
+
+- Every `tl.*` primitive a bench author meets is classified and defined
+  in a single ADR. Paired with ADR-0045 D8's host-side surface
+  (`torch.empty` etc.), the inside-kernel and outside-kernel authoring
+  guides are now complete.
+- The command-list / greenlet difference is pinned in D2, so any new
+  `tl.*` primitive that follows the `_emit()` pattern auto-supports
+  both modes.
+- The thread-local active-ctx pattern (D4) is justified at ADR level,
+  clarifying who owns the reset responsibility when multiple PE
+  runners share a thread (KernelRunner.run's contract restores active
+  inside `_switch_kernel`).
+- `tl.recv_no_consume`'s diagnostic isolation (D3.10) is hardened in
+  ADR form — accidental production use is blocked by a separate
+  command branch.
+- The registry (D7) gets its own D-section, formalizing the
+  name-collision and dynamic-re-registration semantics.
@@ -0,0 +1,259 @@
+# ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
+
+## Status
+
+Accepted (2026-05-22).
+
+Pins down what `runtime_api/distributed.py`'s `AhbmCCLBackend` +
+`DistributedContext` actually install — i.e., the entry point
+`torch.distributed.init_process_group(backend="ahbm")` — and how
+`all_reduce`/`barrier`/`get_rank` etc. are implemented. ADR-0023 D11
+mentions the "torch.distributed compatibility" intent, but **the backend
+itself** had no ADR-level coverage.
+
+## First action
+
+`RuntimeContext.__post_init__` automatically constructs a
+`DistributedContext()` and attaches it to `self.distributed`. The first
+action at that moment:
+
+1. `self._backend: AhbmCCLBackend | None = None` — uninitialized.
+2. `self._rank_by_greenlet: dict = {}` — greenlet-local rank registry
+   (ADR-0024 D2).
+3. The caller (RuntimeContext) sets `dc._ctx_ref = self` so subsequent
+   `init_process_group` can reach `ctx.engine` / `ctx.spec` / `ctx.launch`.
+
+In short, **DistributedContext's first act is "attach to RuntimeContext
+with a back-reference and leave the backend slot empty"**. Actual
+backend installation (IPCQ install, world_size derivation, algorithm
+module import) happens only when user code calls
+`torch.distributed.init_process_group(backend="ahbm")`.
+
+At that moment, `init_process_group`'s first action is:
+
+1. If `backend != "ahbm"`, raise `ValueError("Unsupported backend ...")`
+   immediately.
+2. If `getattr(self, "_ctx_ref", None)` is None,
+   `RuntimeError("DistributedContext not bound to a RuntimeContext")`.
+3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — inside this
+   constructor, ccl.yaml is loaded, the algorithm module is imported,
+   world_size is derived, SFR is configured, and IPCQ is installed.
+4. `self._backend._dist_ctx = self` — the backend gets a back-reference
+   so it can read `_rank_by_greenlet`.
+
+## Context
+
+The `AhbmCCLBackend` exists so that PyTorch DDP collective calls
+(`init_process_group`, `all_reduce`, etc.) work unchanged and bench code
+reads identically to a real DDP training script (in line with
+ADR-0024 + ADR-0027's launcher model).
+
+The backend's responsibilities:
+
+- At `init_process_group` time, install the **IPCQ neighbor table once**
+  (analogous to NCCL communicator creation).
+- For each `all_reduce(tensor, op="sum")`, dispatch the configured
+  algorithm's kernel function via `ctx.launch(...)`.
+- Answer `get_world_size` / `get_rank` consistently from the
+  greenlet-local rank registry plus ccl.yaml/topology.
+
+ADR-0023 D10 (IPCQ install plan) and ADR-0024 (SIP launcher) touch
+parts of this, but **the backend's own responsibility scope and decision
+order** are not pinned anywhere. This ADR fills that gap.
+
+## Decision
+
+### D1. The backend is created only at `init_process_group(backend="ahbm")` time
+
+`DistributedContext` starts with `_backend = None`. The backend object
+does not exist until the user calls
+`dist.init_process_group(backend="ahbm")`. Any other API
+(`is_initialized`, `get_world_size`, `all_reduce`, `barrier`) called
+while `_backend` is None raises
+`RuntimeError("Default process group has not been initialized...")` via
+the `_ensure_initialized` helper.
+
+`backend != "ahbm"` raises `ValueError` immediately. Other backend names
+(`nccl`, `gloo`, etc.) are not recognized.
+
+### D2. world_size resolution priority — algorithm > defaults > topology
+
+`AhbmCCLBackend._resolve_world_size` (ADR-0024 D1):
+
+1. If `ccl.yaml`'s algorithm entry has `world_size`, use it.
+2. Else if `defaults.world_size` is set, use it.
+3. Else fall back to `spec.system.sips.count` (the topology's SIP count).
+
+The default interpretation is **rank = SIP** (ADR-0024). Cube/PE-level
+parallelism is expressed inside each rank via DPPolicy and does not
+affect world_size. An explicit `ccl.yaml` override is preserved for the
+legacy "rank = flat PE index" test path.
+
+User arguments to `init_process_group(world_size=..., rank=...)` are
+**accepted but ignored** (same as real PyTorch's `RANK` / `WORLD_SIZE`
+env vars).
+
+### D3. `init_process_group` performs four installation steps
+
+Inside `AhbmCCLBackend.__init__`, in order:
+
+1. **Load ccl.yaml**: `kernbench.ccl.install.load_ccl_config()` →
+   `resolve_algorithm_config(_cfg_all)` produces the merged config for
+   `defaults.algorithm` (or the user-specified algorithm).
+2. **Import algorithm module**:
+   `importlib.import_module(self._merged["module"])`. The module must
+   expose a `kernel` function, a `kernel_args(world_size, n_elem,
+   cube_w, cube_h)` helper, and optionally a `TOPO_NAME_TO_KIND` map.
+3. **Resolve world_size** (D2).
+4. **Collect topology metadata** from `spec`: `n_sips`, `sip_topo`
+   (`ring_1d` default), `cube_w`/`cube_h`, `sips.w`/`sips.h`. When the
+   SIP topology is not `ring_1d`, derive `_sip_topo_w/h` from explicit
+   `w`/`h` or via square-root (require `w*h == n_sips`). Mismatch raises
+   `ValueError`.
+5. **Install SFR + IPCQ**:
+   `kernbench.ccl.sfr_config.configure_sfr_intercube_multisip(engine,
+   spec, self._merged)`. This pushes IPCQ neighbor tables to every
+   SIP/cube's pe0 (one-time setup analogous to NCCL communicator
+   creation).
+
+If the order changes (e.g., SFR runs before the algorithm module
+loads), partial initialization can result. So D3 is treated as an
+atomic 4-step block — on failure the backend remains uninstalled.
+
+### D4. Greenlet-local rank binding (ADR-0024 D2)
+
+`DistributedContext._rank_by_greenlet: dict[greenlet, int]` maps spawned
+worker greenlets to their ranks. When the bench launcher (e.g.,
+`torch.multiprocessing.spawn`) spawns a worker, it registers via
+`dc._bind_rank(g, rank)`.
+
+`get_rank()` looks up `getcurrent()`'s greenlet. Unregistered greenlets
+fall back to 0 — preserves single-driver / test compatibility.
+
+The backend reads the current greenlet's rank from
+`_dist_ctx._rank_by_greenlet` during `all_reduce` (D5).
+
+### D5. `all_reduce(tensor, op="sum")` behavior
+
+Validation:
+
+- `op != "sum"` → `NotImplementedError`. Current kernels only
+  implement add reduction.
+- `tensor._handle is None` → `RuntimeError("not deployed")`.
+- `tensor._handle.shards` empty → `RuntimeError("no shards")`.
+
+Preparation:
+
+- `n_elem = shards[0].nbytes // tensor.itemsize` — element count of a
+  single shard.
+- `kernel_fn = self._algo_module.kernel` — the algorithm module's entry
+  function (imported in D3).
+- Decide effective cube dims: if the first SIP has just 1 cube, use
+  `(1, 1)`; otherwise use the topology's `cube_w`/`cube_h`. This
+  naturally absorbs TP runs that use only a subset of cubes.
+- `kernel_args = self._algo_module.kernel_args(world_size, n_elem,
+  cube_w, cube_h)` — the algorithm decides which arguments to pass to
+  its kernel.
+
+Dispatch:
+
+- Resolve the current greenlet's rank via
+  `_rank_by_greenlet.get(g, 0)`.
+- Append `extra_args = (sip_rank, sip_topo_kind, sip_topo_w,
+  sip_topo_h)`.
+- `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor,
+  *kernel_args, *extra_args, _defer_wait=True)` — `_defer_wait=True`
+  delegates collective drain to the main scheduler (ADR-0027 D0.4).
+
+Drain:
+
+- If the parent greenlet is alive (multi-greenlet mode), enqueue
+  `_pending_collective_handles` and switch to parent. The main
+  scheduler drains after all ranks have launched.
+- If single-driver mode, drain inline:
+  `for h, _sip_id, meta in pending: self.ctx.wait(h, _meta=meta)`.
+
+### D6. `barrier()` is a no-op (single-driver model)
+
+kernbench runs all ranks as greenlets inside a single Python process,
+so no cross-process synchronization is needed. `barrier()` is callable
+but does no synchronization. Kept for real-PyTorch API compatibility so
+callers don't get `NotImplementedError`.
+
+If multi-process kernbench (SimPy event loop per process) is introduced
+in the future, D6 needs a superseding ADR.
+
+### D7. Semantics of `get_rank` / `get_world_size` / `get_backend`
+
+- `get_rank()` (D4): the current greenlet's bound rank; unregistered → 0.
+- `get_world_size()` (D2): the world_size resolved by the backend in D3.
+- `get_backend()`: always the literal string `"ahbm"`. Calling before
+  backend exists triggers `_ensure_initialized`'s RuntimeError.
+
+Differences vs. real PyTorch:
+
+- Real PyTorch `get_rank()` is a process-global value; here it is
+  greenlet-local. Inside a spawned worker → the worker's rank; in the
+  main thread → 0. Bench authors should expect meaningful ranks only
+  inside worker functions.
+
+### D8. Supported API surface (final)
+
+`DistributedContext` exposes:
+
+- `init_process_group(backend="ahbm", world_size=None, rank=None,
+  **kwargs)`
+- `is_initialized() -> bool`
+- `get_world_size() -> int`
+- `get_rank() -> int`
+- `get_backend() -> str`
+- `all_reduce(tensor, op="sum") -> None`
+- `barrier() -> None`
+- (internal) `_bind_rank(g, rank)`
+
+Other PyTorch distributed APIs (`broadcast`, `reduce`, `all_gather`,
+`gather`, `scatter`, point-to-point `send/recv`, etc.) are **not
+implemented**. Kernel-level expression is available via
+`tl.send`/`tl.recv` (ADR-0046 D3.10), but the `dist.*` surface does not
+expose them. If additional collectives are needed, add a paired
+(algorithm module, `DistributedContext` method) and extend D8.
+
+## Alternatives Considered
+
+### A1. Create the backend in `RuntimeContext.__init__`
+
+Rejected. If `ccl.yaml` is missing or the algorithm module can't be
+imported, RuntimeContext construction would fail even when the bench
+does not use distributed features. Lazy creation at call time (D1) is
+the right semantics.
+
+### A2. Always derive world_size from topology (no override)
+
+Rejected. ADR-0024 D1's "explicit override" path is used by legacy
+tests. Diagnostic scenarios that define PE-level ranks within a single
+SIP also need this escape hatch.
+
+### A3. Silent fallback for unsupported `op`
+
+Rejected. If the user intends `op="prod"` / `"max"` / `"avg"` and silent
+`sum` runs instead, result validation gets very hard. Explicit
+`NotImplementedError` is safer.
+
+### A4. Implement `barrier` as a SimPy event
+
+Rejected (currently). With single-driver semantics there is no
+cross-process synchronization to express, so a no-op is meaningfully
+correct. A fake-barrier SimPy event would add code complexity for no
+semantic gain. Revisit when multi-process kernbench arrives.
+
+## Consequences
+
+- The 4-step installation (D3) for
+  `torch.distributed.init_process_group(backend="ahbm")` is locked in,
+  making clear where future collective algorithms must hook.
+- The priority order in D2 (algorithm > defaults > topology) makes the
+  blast radius of ccl.yaml changes quickly knowable.
+- The no-op `barrier` (D6) is recorded so multi-process kernbench, if
+  introduced, must explicitly supersede this ADR.
+- D8's list of unsupported APIs explicitly grounds the rejection
+  message when users call, e.g., `dist.broadcast(...)`.
@@ -0,0 +1,278 @@
+# ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
+
+## Status
+
+Accepted (2026-05-22).
+
+Pins down the free-list algorithm, page alignment, and coalescing rules
+used by `policy/address/allocator.py`'s `_FreeList` / `PEMemAllocator`
+and `va_allocator.py`'s `VirtualAllocator`. ADR-0001 (PhysAddr layout)
+and ADR-0011 (PA/VA/LA models) define the address schemes; the
+**allocation algorithms** had no ADR-level coverage.
+
+## First action
+
+### `_FreeList(capacity)`
+
+On construction: `self._capacity = capacity`, `self._used = 0`,
+`self._free = [(0, capacity)]`. The first act is **establishing the
+entire region as one free block** — the tuple `(offset=0,
+size=capacity)` is the sole entry in the free list.
+
+### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
+
+On construction, builds two `_FreeList`s:
+
+- `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — the size of this PE's
+  HBM slice (`hbm_bytes_per_cube // hbm_slices_per_cube`).
+- `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — equals
+  `tcm_bytes_per_pe - tcm_scheduler_reserved_bytes` (the scheduler
+  reservation is pre-deducted).
+
+So PEMemAllocator's first act is **constructing single-free-block
+HBM-slice and TCM regions for this PE**.
+
+### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
+
+On construction: `self._va_base = va_base`, `self._va_size = va_size`,
+`self._page_size = page_size`, `self._used = 0`, `self._free =
+[(va_base, va_size)]`. The first act is **establishing one block from
+va_base to va_size and stashing page_size**.
+
+## Context
+
+`runtime_api/context.py::_ensure_allocators` builds the allocator set
+in these stages:
+
+1. Read `hbm_total_gb_per_cube`, `hbm_slices_per_cube`, `tcm_size_mb`,
+   per-target_device SIP range, etc. from `spec`.
+2. Pack everything into a frozen `AddressConfig`.
+3. For every combination in the target SIP range × cubes × PEs,
+   construct one `PEMemAllocator(sip, cube, pe, cfg)` instance.
+4. Construct one `VirtualAllocator(va_base=0x1_0000_0000, va_size=64
+   GiB, page_size=pe_mmu.page_size)`.
+
+Allocator responsibilities:
+
+- **PEMemAllocator**: PA-space allocation in the PE-local HBM slice /
+  TCM (including PhysAddr encoding).
+- **VirtualAllocator**: device-wide VA allocation, page-aligned.
+  `RuntimeContext._create_tensor` then pushes VA → PA mappings to
+  components via `MmuMapMsg`.
+
+These algorithms are:
+
+- **First-fit**, kept simple.
+- The free-block list is **sorted by start offset**.
+- On `free()`, **adjacent blocks coalesce**.
+
+The rationale was not documented anywhere, so when someone asks "why
+not best-fit?", "why not a buddy allocator?", "why does partial-overlap
+free pass silently?", there was no anchor to answer from. This ADR
+provides it.
+
+## Decision
+
+### D1. `_FreeList` — offset-keyed first-fit + coalescing
+
+`policy/address/allocator.py::_FreeList`:
+
+- Internal representation: `list[tuple[int, int]] = [(start_offset,
+  size), ...]` — sorted by start offset.
+- `alloc(nbytes)`:
+  1. Iterate the free list from the front (first-fit).
+  2. Take from the first block with `size >= nbytes`.
+  3. Exact match → drop the block; otherwise shrink it to `(start +
+     nbytes, size - nbytes)`.
+  4. `_used += nbytes`; return the taken `start`.
+  5. If no block fits, `AllocationError("overflow ... largest free
+     block ...")`.
+- `free(offset, nbytes)`:
+  1. `_used -= nbytes`.
+  2. `bisect_left(self._free, (offset,))` finds the insertion index.
+  3. If adjacent to the previous block (`prev_start + prev_size ==
+     offset`), merge.
+  4. If adjacent to the next block (`offset + nbytes == next_start`),
+     merge.
+  5. Insert the coalesced range at the right sorted position.
+
+This algorithm is weaker than best-fit / buddy on fragmentation, but
+the simulator's workload (mostly stack-like deploy/free) tolerates it.
+If the workload shape changes, D1 is a supersession candidate.
+
+### D2. Partial-overlap free is **not** validated
+
+`_FreeList.free(offset, nbytes)` trusts the caller to pass the exact
+`(offset, nbytes)`. It does **not** verify:
+
+- That the range was actually allocated.
+- That the range does not overlap another allocated region.
+
+Reason: in a simulator context, callers always store the return value
+of `alloc()` and pass it back to `free()` — there is no external user
+input. Adding a safety check would cost O(N) per free and impact
+simulation wall-clock.
+
+If this trust model breaks (e.g., a code path lets two tensors point
+at the same PA), this ADR must be revisited.
+
+### D3. `PEMemAllocator` — two channels for HBM/TCM
+
+`PEMemAllocator(sip_id, die_id, pe_id, cfg)` holds two `_FreeList`s:
+
+- `_hbm`: size `cfg.hbm_slice_bytes`.
+- `_tcm`: size `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
+  tcm_scheduler_reserved_bytes`).
+
+`alloc_hbm(nbytes) -> PhysAddr`:
+
+- `_hbm.alloc(nbytes)` → offset.
+- `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id,
+  pe_local_hbm_offset=offset, slice_size_bytes=cfg.hbm_slice_bytes)`.
+- Failure raises `AllocationError("HBM overflow ...")`.
+
+`free_hbm(pa, nbytes)`:
+
+- Recover PE-local offset via `pa.hbm_offset - pe_id *
+  cfg.hbm_slice_bytes`.
+- `_hbm.free(offset, nbytes)`.
+
+`alloc_tcm(nbytes) -> PhysAddr`: similar; uses `PhysAddr.pe_tcm_addr`.
+
+`free_tcm(pa, nbytes)`: uses `pa.sub_offset` directly (TCM's PE-local
+offset equals its sub_offset).
+
+The allocator does not see the scheduler-reserved TCM region
+(`cfg.tcm_scheduler_reserved_bytes`) — it is pre-subtracted from the
+`_tcm` capacity. This is consistent with ADR-0014's PE_SCHEDULER
+internal-buffer reservation.
+
+### D4. `VirtualAllocator` — page-aligned first-fit + coalescing
+
+`policy/address/va_allocator.py::VirtualAllocator`:
+
+- Internal representation: same sorted `list[tuple[int, int]]` as
+  `_FreeList`. Initially `[(va_base, va_size)]`.
+- `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
+- `alloc(nbytes) -> int`:
+  1. `aligned = _align_up(nbytes)`.
+  2. First-fit a block with `size >= aligned`.
+  3. Take `aligned` from the block's front; remove if exact.
+  4. `_used += aligned`. Return the block's `start` (which is page-
+     aligned).
+  5. Failure → `VaAllocationError`.
+- `free(va, nbytes)`: free `_align_up(nbytes)` worth. Coalesces with
+  the same algorithm as `_FreeList`.
+
+`page_size` has different defaults in two places:
+
+- `VirtualAllocator.__init__`'s parameter default: `2 MiB`. Direct-call
+  tests receive this.
+- `RuntimeContext._ensure_allocators` when constructing the instance:
+  `pe_mmu.attrs.get("page_size", 4096)` — uses
+  `topology.yaml`'s `pe_mmu.attrs.page_size` if set, else falls back
+  to `4 KiB`.
+
+The two defaults differ on purpose: `VirtualAllocator`'s standalone
+default (`2 MiB`) aligns with ADR-0039's PE_MMU stopgap default for
+direct-test ergonomics; the context fallback (`4 KiB`) is the safe
+minimum when `topology.yaml` doesn't specify a page size. The
+production path is always the latter (via `_ensure_allocators`), and
+when `topology.yaml` sets `page_size`, that value flows consistently
+into both the MMU and the VA allocator.
+
+If consistency breaks (e.g., VirtualAllocator instantiated with a
+page_size different from PE_MMU's), MMU `map()` falls into the
+sub-page region mode (ADR-0039 D3).
+
+VA range defaults: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64
+GiB`. These are hardcoded in `_ensure_allocators` and have no
+semantic meaning in ADR-0011's VA model — they simply reserve enough
+device-wide space without colliding with host code.
+
+### D5. Lifecycle of allocator instances
+
+- `RuntimeContext._ensure_allocators` is lazy — called on the first
+  `_create_tensor`.
+- The allocator dict (`self._allocators`) lives for the
+  RuntimeContext's lifetime. A second deploy in the same process
+  does not construct new objects.
+- `RuntimeContext.cleanup()` walks living tensors and calls
+  `_free_tensor()`, which issues MMU unmaps + `va_allocator.free` +
+  `pemem_allocator.free_hbm` — restoring the free lists. A subsequent
+  RuntimeContext starts fresh.
+
+This per-RuntimeContext isolation guarantees deterministic deploy →
+cleanup → deploy sequences within a single process.
+
+### D6. Allocator failure raises (no silent OOM)
+
+Both `_FreeList.alloc` and `VirtualAllocator.alloc` raise
+`AllocationError` / `VaAllocationError` when no block fits. The message
+includes "required size + largest available block" to distinguish
+fragmentation from true OOM.
+
+A silent fallback (e.g., allocating only as much as the largest free
+block) is strictly forbidden — a partially-allocated tensor reaching
+SimPy would cause routing / DMA to see incorrect PAs and silently
+corrupt simulation results.
+
+### D7. One allocator per address space
+
+Physical address spaces are separated by PhysAddr sub-units (ADR-0001
+D2.3); each sub-unit gets its own allocator instance:
+
+- HBM slice → `PEMemAllocator._hbm`.
+- PE TCM → `PEMemAllocator._tcm`.
+- (Currently unused) M_CPU local memory, CUBE SRAM → would need their
+  own allocators. Today these are handled as IPCQ-only slots (ADR-0023
+  D9.7) and do not share PA space, so no free-list exists for them.
+
+When a cube-level SRAM allocator is needed,
+`_FreeList(cfg.sram_bytes_per_cube)` is added per-cube
+(`cfg.sram_bytes_per_cube` is already defined in `AddressConfig` —
+the data model is ready).
+
+## Alternatives Considered
+
+### A1. Best-fit / buddy allocator
+
+Rejected (currently). The workload's alloc/free pattern is stack-like
+(deploy order ≈ free order), so first-fit + coalescing controls
+fragmentation well enough. If long-running fragmentation appears in LLM
+kernel sweeps, a buddy-allocator ADR will replace D1.
+
+### A2. Add partial-overlap free validation
+
+Rejected. D2's trust model plus the O(N) per-free cost makes this
+unattractive. A debug mode (e.g., `KERNBENCH_DEBUG` env var) that
+enables the check could be added later.
+
+### A3. A unified allocator for VA and PA
+
+Rejected. VA space (64 GiB device-wide) and PA space (per-slice ~6
+GiB) have different semantic dimensions — VA is the kernel's view, PA
+is the device sub-unit's view. ADR-0011's VA model (MMU maps between
+the two) calls for separated allocators.
+
+### A4. Multi-tier page sizes (large pages + small pages)
+
+Rejected (currently). A single page size (2 MiB) matches LLM kernel
+tensor sizes (a few MiB to GiB); smaller mappings are absorbed by
+ADR-0039 D3's sub-page region mode. Multi-tier paging would require
+extending the MMU model itself — a separate ADR candidate.
+
+## Consequences
+
+- The allocator algorithm is pinned at ADR level (D1, D3, D4), so any
+  future simulation scenario hitting fragmentation has a clear "we're
+  using first-fit + coalescing" anchor to inspect.
+- D2's trust model is explicit, so any future code path that exposes
+  alloc/free to direct user input will trigger this ADR's supersession
+  early.
+- D7's one-allocator-per-sub-unit mapping is on record, so when M_CPU
+  or SRAM need their own free-list, the addition point is obvious.
+- D4 captures the page_size dual-default and its production path
+  (`_ensure_allocators` always wins), letting future `topology.yaml`
+  `page_size` changes be assessed against ADR-0039's stopgap
+  interaction quickly.
@@ -0,0 +1,247 @@
+# ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
+
+## Status
+
+Accepted (2026-05-22).
+
+Pins down the traffic-pattern catalog, formula-vs-actual comparison, and
+invariant checks (monotonicity, D2H ≥ H2D, etc.) exposed by
+`probes/probe.py::run_probe(...)`. ADR-0010 (CLI surface) enumerates the
+`kernbench probe` subcommand, but **what probe actually measures** and
+**which invariants it judges PASS/FAIL** had no ADR-level coverage.
+
+## First action
+
+`run_probe(topology_path, case_filter=None)` performs four startup steps:
+
+1. `Path(topology_path).expanduser().resolve()` → absolute path.
+2. `load_topology(path)` → `TopologyGraph` (graph + spec).
+3. `_build_edge_map(graph)` → a `{(src, dst): Edge}` lookup table.
+4. Instantiate `AddressResolver(graph)` + `PathRouter(graph)`.
+
+Then it sets `nbytes = 32768` (= 32 KiB, the summary-table reference
+size) and `show_all = (case_filter is None or case_filter == "all")`.
+
+In short, **probe's first act is "load the topology once and prepare
+edge map / resolver / router, plus pin 32 KiB as the standard measurement
+size"**. After that, the H2D → D2H → PE DMA categories execute in
+separate `GraphEngine` instances (no cross-talk between cases).
+
+## Context
+
+`kernbench probe` was introduced as a verification tool for these
+purposes:
+
+- **Manual ground truth**: when a real-simulation result (`kernbench run
+  --bench ...`) shows abnormal latency, derive the answer for a simple
+  traffic pattern in isolation and compare.
+- **Formula vs actual**: check whether the analytical model
+  (wire latency + overhead + drain) matches the simulator's
+  `total_ns`. A mismatch points to which simplifying assumption in
+  ADR-0033 is missing.
+- **Monotonicity check**: latency should grow monotonically with hop
+  count.
+- **Utilization sweep**: a BW-utilization table across data sizes
+  (4 KiB ~ 1 MiB).
+
+Without an ADR for this tool:
+
+- Adding a new traffic-pattern category (e.g., MCpuDma, IPCQ) is hard
+  because the table format / measurement units of existing categories
+  aren't documented at the ADR level.
+- The basis for the monotonicity check (hop count? cube distance? wire
+  length?) is ambiguous.
+- The reference size 32 KiB and the sweep `[4 KiB, 16 KiB, 64 KiB, 256
+  KiB, 1 MiB]` are only discoverable by reading source.
+
+## Decision
+
+### D1. Three case categories — H2D / D2H / PE DMA
+
+Each category has a distinct data path in the topology and gets its own
+summary table + sweep table + route-detail block.
+
+- **H2D (Host → Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
+  dst_pe=0, pattern="zero")` flows along `pcie_ep → io_cpu → m_cpu →
+  hbm_ctrl`. The cube index varies the hop count:
+  - h2d-1hop: cube=0, hops=1
+  - h2d-2hop: cube=4, hops=2
+  - h2d-3hop: cube=8, hops=3
+  - h2d-4hop: cube=12, hops=4
+- **D2H (Device → Host Read)**: `MemoryReadMsg(src_sip=0, src_cube,
+  src_pe=0)`. Total latency = forward command path + reverse data path.
+  Same 4-hops category as H2D.
+- **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe,
+  dst_pa)`. Five cases cover varying cube/PE positions:
+  - pe-local-hbm: same cube, same PE
+  - pe-same-half-hbm: same cube, different PE (PE 1)
+  - pe-cross-half-hbm: same cube, far PE (PE 4)
+  - pe-cross-cube-hbm-best: adjacent cube (cube 1)
+  - pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
+
+The cube indices 4/8/12 (H2D) and 1/4/15 (PE DMA) are meaningful for a
+4 × 4 cube mesh (`sip.cube_mesh.w=4, h=4`); changes to the mesh size
+require these to be updated in lockstep.
+
+### D2. Standard measurement size — `nbytes = 32768` (32 KiB)
+
+Every case in the summary table runs once with `nbytes=32768`. 32 KiB
+was chosen because:
+
+- DMA overhead and BW drain are balanced — neither dominates.
+- It compares cleanly against the one-shot transfer size of several
+  sub-units (TCM, register file).
+
+Per-size utilization variations are shown in a separate sweep table
+(D3).
+
+### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
+
+`SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`,
+`SWEEP_LABELS = ["4KB", "16KB", "64KB", "256KB", "1MB"]`. Per size:
+
+```
+drain   = nbytes / bottleneck_bw
+total   = overhead + wire + drain
+eff_bw  = nbytes / total
+util%   = eff_bw / bottleneck_bw × 100
+```
+
+When `bn_bw is None or <= 0`, the column shows 0.0 %. The intent: the
+table shows in one view how small transfers become overhead-bound and
+large transfers become drain-bound as hop count rises.
+
+### D4. Measured columns — actual / formula / breakdown
+
+Per-case columns:
+
+- `Actual` (total_ns): the SimPy run's `trace["total_ns"]`.
+- `Ovhd`: sum of `node.attrs["overhead_ns"]` along the path (formula).
+- `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
+- `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
+- `Ovhd%` / `Drain%`: each portion as a percentage of Actual. Wire is
+  usually too small to display.
+- `Eff.BW`: `nbytes / total_ns` (measured BW).
+- `BN.BW`: bottleneck bandwidth (formula). The minimum edge BW along
+  the path. Missing edge BW shows "-".
+- `Util%`: `Eff.BW / BN.BW × 100`. 100 % means the single-stream BW
+  upper bound is reached.
+
+A large gap between the formula sum (`wire + ovhd + drain`) and Actual
+signals a factor the simplified model misses (a place to inspect
+ADR-0033's assumptions).
+
+### D5. Automatic invariant checks — PASS/FAIL
+
+The following invariants are reported with `[v] PASS` / `[x] FAIL`:
+
+- **H2D / D2H monotonic increase**: as hop count rises, actual latency
+  must grow monotonically. `all(lats[i] < lats[i+1] for ...)`.
+- **D2H ≥ H2D**: for the same hop index, D2H ≥ H2D (D2H has both
+  forward command and reverse data legs). `all(d2h[i].total >=
+  h2d[i].total)`.
+- **PE DMA best < worst**: cross-cube best (adjacent) latency must be
+  less than cross-cube worst (diagonal).
+- **PE DMA local vs remote**: prints the local BN BW vs remote BN BW
+  side-by-side (informational, not PASS/FAIL).
+
+When a check fails, a single clear line surfaces the regression for
+human review.
+
+### D6. Route detail — per-hop timestamp trace
+
+After the summary and sweep tables, each case's path and cumulative
+per-hop timestamps (`_hop_timestamps`) appear in a separate section:
+
+- H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
+  (`m_cpu → hbm_ctrl`) + per-hop trace.
+- D2H: forward (cmd, no data) and reverse (data) traces shown
+  separately.
+- PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
+
+Each hop's timestamp is cumulative `wire_ns + overhead_ns`. The
+terminal hop's annotation appends `drain:Xns`. Bottleneck edges are
+marked `<BN:XXGB/s>` so they are visually identifiable.
+
+### D7. Semantics of the `case_filter` argument
+
+- `None` or `"all"`: run all cases (default).
+- Other strings: run only the case whose name matches exactly. Example:
+  `kernbench probe --case h2d-2hop`.
+
+Within a category, cases with `name != case_filter` are skipped; if
+only one data point remains, the category's monotonicity / D2H ≥ H2D
+comparisons are naturally skipped.
+
+The CLI parser's `--case` default is `"all"`, so omitting it runs
+everything.
+
+### D8. Fresh GraphEngine per case
+
+Each of the 4 H2D, 4 D2H, and 5 PE DMA cases runs in **its own
+GraphEngine** (`engine = GraphEngine(graph)`). Reasons:
+
+- Isolate accumulated state (op_log, completion tracking, allocators)
+  so cases do not cross-talk.
+- Guarantee one case's traffic does not perturb another case's BW
+  measurement.
+
+This isolation lets probe results be interpreted as **single-flow**
+per-case latency. Multi-flow contention measurement is handled by
+separate tooling (e.g., the `pe2pe_overview` plot or ADR-0033's
+multi-flow merging model).
+
+### D9. Output-format stability
+
+probe's stdout is meant for humans; precise column widths, separators,
+and whitespace are **not** a machine-readable contract. Automated tools
+that wish to parse probe output should use a separate JSON-output mode
+(not yet implemented).
+
+The `[v]` / `[x]` prefix on PASS/FAIL lines is a stable CI grep anchor.
+
+## Alternatives Considered
+
+### A1. Register probe as another bench (`@bench(name="probe")`)
+
+Rejected. probe is a verification tool, not a bench — multi-engine
+execution for sweeps/analysis and PASS/FAIL invariant output are
+essential, none of which fits ADR-0045's "single device + single
+RuntimeContext" bench model.
+
+### A2. Exit code 1 on monotonicity violation
+
+Rejected (currently). probe is positioned as a human inspection tool —
+PASS/FAIL is printed and exit is 0. A wrapper can `grep "\[x\]"` to
+decide. A future `--strict` flag could opt into non-zero exits.
+
+### A3. Externalize the case catalog to YAML
+
+Rejected (currently). The 8 cases (4 H2D + 4 D2H + 5 PE DMA = 13 total)
+are hardcoded and their semantics are tightly bound to the mesh
+topology. Moving cube-index meaning (4, 8, 12 / 1, 4, 15) into YAML
+would require separate documentation and lose cohesion. Externalize
+only when case additions become frequent.
+
+### A4. Add multi-flow contention measurement
+
+Rejected (out of probe scope). D8's single-flow isolation is probe's
+core intent. Multi-flow contention belongs in a different area of the
+ADR-0033 latency model — either a separate tool or a new case
+category.
+
+## Consequences
+
+- probe's case catalog (D1) and measurement units (D2/D3) are pinned at
+  ADR level, so new traffic categories know which table format to
+  follow.
+- The semantics of the formula-vs-actual columns (D4) are locked in, so
+  questions like "why is Drain% 5 % or 70 %?" can quickly be linked to
+  ADR-0033 assumption checks.
+- Automatic invariant checks (D5) are pinned, so latency-model changes
+  immediately catch monotonicity / D2H ≥ H2D regressions.
+- D8's case-isolation is explicit, so probe results are safe to read as
+  single-flow measurements. If multi-flow is needed, a separate tool
+  track is clearly required.
+- A2's strict-mode flag is recorded as a follow-up so CI integration
+  has a minimal change path when requested.
@@ -0,0 +1,322 @@
+# ADR-0050: CCL Algorithm Module Contract — `ccl/algorithms/*.py`
+
+## Status
+
+Accepted (2026-05-22).
+
+Pins down the interface, kernel signature, and addition workflow that a
+module under `src/kernbench/ccl/algorithms/` must satisfy in order to be
+used as a collective algorithm by the AHBM CCL backend (ADR-0047).
+ADR-0047 D3 states only that "the algorithm module must expose `kernel`,
+`kernel_args`, optionally `TOPO_NAME_TO_KIND`"; **the contract an
+algorithm-module author needs to follow** has had no ADR-level coverage.
+This ADR pairs with ADR-0045's bench-module contract.
+
+## First action
+
+An algorithm module is imported at two moments:
+
+1. **AHBM backend entry**: when user code calls
+   `dist.init_process_group(backend="ahbm")`,
+   `AhbmCCLBackend.__init__` runs
+   `self._algo_module = importlib.import_module(self._merged["module"])`.
+   At module level, the following occur first:
+   - Topology-kind integer constants like `SIP_TOPO_RING/TORUS/MESH`
+     are bound in the module namespace.
+   - The `TOPO_NAME_TO_KIND` dict is bound; the backend reads it via
+     `getattr(self._algo_module, "TOPO_NAME_TO_KIND", None)`.
+   - `kernel_args` function is defined for the caller.
+   - The actual algorithm function (e.g.,
+     `allreduce_intercube_multidevice`) is defined.
+   - At the bottom of the module, `kernel = allreduce_intercube_multidevice`
+     publishes the alias.
+
+2. **ccl.yaml install stage**:
+   `kernbench.ccl.install.install_ipcq` imports the same algorithm
+   module while pushing the IPCQ neighbor table.
+
+In short, **the algorithm module's first act is "publish topology-kind
+constants, the `TOPO_NAME_TO_KIND` dict, the `kernel_args` function, and
+the `kernel` alias into the module namespace"** — all as import-time
+side effects, no separate initialization call.
+
+## Context
+
+`AhbmCCLBackend` (ADR-0047), at process-group creation, dynamically
+imports a module path obtained from `ccl.yaml`'s `defaults.algorithm` (or
+a user-specified algorithm). The backend expects four things from the
+module:
+
+- `kernel`: the collective's entry function.
+- `kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple`: a tuple
+  packing the kernel's positional arguments.
+- `TOPO_NAME_TO_KIND` (optional): a dict mapping `topology.yaml`'s
+  `sips.topology` string (e.g., `"ring_1d"`, `"torus_2d"`,
+  `"mesh_2d_no_wrap"`) to the integer kind constants.
+- (Indirectly) IPCQ neighbor-table install:
+  `configure_sfr_intercube_multisip` reads
+  the module's `TOPO_NAME_TO_KIND` plus cube dimensions to decide the
+  SFR.
+
+The current corpus has one algorithm module:
+`lrab_hierarchical_allreduce.py` (248 lines). The name expands to
+"**l**eft-**r**ight **a**lternating **b**roadcast hierarchical allreduce".
+When future modules like `ring_allreduce`, `tree_allreduce`, or
+`broadcast` are added, they must follow this contract for the backend's
+dispatch path to keep working.
+
+Without an ADR-level contract:
+
+- A new algorithm author has to infer the signature from ADR-0047 D3's
+  one-liner.
+- The kernel-function argument order (especially `t_ptr, n_elem,
+  cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w,
+  sip_topo_h, tl`) is unclear without grep.
+- It is conventional, but not documented, what `kernel_args` takes as
+  inputs and what tuple it must return.
+
+## Decision
+
+### D1. The algorithm module exposes four public symbols
+
+```python
+# src/kernbench/ccl/algorithms/<name>.py
+
+from __future__ import annotations
+
+# (required) topology-kind constants — referenced internally
+SIP_TOPO_RING = 0
+SIP_TOPO_TORUS = 1
+SIP_TOPO_MESH = 2
+
+# (optional) topology name → kind mapping. Used by the backend to
+# translate ccl.yaml/topology's string SIP topology into an integer.
+TOPO_NAME_TO_KIND = {
+    "ring_1d": SIP_TOPO_RING,
+    "torus_2d": SIP_TOPO_TORUS,
+    "mesh_2d_no_wrap": SIP_TOPO_MESH,
+}
+
+# (required) kernel argument builder
+def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
+    return (n_elem, cube_w, cube_h, world_size)
+
+# (required) kernel function (TLContext is injected via the `tl=...`
+# keyword argument).
+def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips,
+                         sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl):
+    ...
+
+# (required) kernel alias — the backend accesses `module.kernel`
+kernel = my_allreduce_kernel
+```
+
+- The `kernel` alias is the entry point the backend invokes. Whatever
+  the function name is (e.g., `allreduce_intercube_multidevice`), it
+  must be exposed via `module.kernel = fn`.
+- Without `kernel_args`, the backend has no way to build the
+  algorithm's argument list. See D2 for the signature.
+- If `TOPO_NAME_TO_KIND` is absent, the backend falls back to
+  `sip_topo_kind = 0`. An algorithm supporting only a single topology
+  may omit it.
+
+### D2. `kernel_args` signature — `(world_size, n_elem, *, cube_w, cube_h)`
+
+```python
+def kernel_args(world_size: int, n_elem: int, *,
+                cube_w: int = 4, cube_h: int = 4) -> tuple:
+    return (n_elem, cube_w, cube_h, world_size)
+```
+
+- **Positional arguments**: `world_size` (= number of ranks), `n_elem`
+  (= element count of a single shard, f16-based).
+- **Keyword arguments**: `cube_w`, `cube_h` (= cube-mesh dimensions).
+  Default 4×4 — aligned with `topology.yaml`'s `sip.cube_mesh` default.
+- **Return**: a tuple in the order the kernel's positional arguments
+  expect.
+
+When the backend calls `all_reduce`:
+
+```python
+kernel_args_tuple = self._algo_module.kernel_args(
+    self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h,
+)
+extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)
+pending = self.ctx.launch(
+    self._merged["algorithm"], kernel_fn, tensor,
+    *kernel_args_tuple, *extra_args, _defer_wait=True,
+)
+```
+
+So the kernel's full positional argument list becomes: `(tensor_ptr,
+*kernel_args_tuple, sip_rank, sip_topo_kind, sip_topo_w,
+sip_topo_h)`, with `tl=...` injected as a keyword. The tuple length
+and order returned by `kernel_args` must **match the kernel signature
+1:1**.
+
+### D3. Kernel signature — standardized 9 + tl arguments
+
+Recommended signature:
+
+```python
+def my_kernel(
+    t_ptr: int,             # VA base of the row-wise-sharded tensor on this SIP
+    n_elem: int,            # element count per cube tile (or per shard)
+    cube_w: int,            # cube mesh width  (from kernel_args)
+    cube_h: int,            # cube mesh height (from kernel_args)
+    n_sips: int,            # equal to world_size (rank = SIP, ADR-0024)
+    sip_rank: int,          # this SIP's rank
+    sip_topo_kind: int,     # result of TOPO_NAME_TO_KIND lookup
+    sip_topo_w: int,        # SIP mesh width (0 for ring_1d)
+    sip_topo_h: int,        # SIP mesh height (0 for ring_1d)
+    *, tl,                  # TLContext (auto-injected)
+) -> None:
+```
+
+Even if `kernel_args` chose a different positional argument order, the
+kernel's **last four positional arguments are always
+`(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`** — the backend
+appends them as `extra_args` (ADR-0047 D5). A custom algorithm must
+accept these four, but a single-SIP algorithm may simply ignore them.
+
+`tl` is injected via keyword — `RuntimeContext.launch` adds `tl=tl_ctx`
+just before invoking the kernel. The signature therefore exposes `tl`
+as keyword-only (`*, tl`) or as the trailing keyword parameter.
+
+### D4. Kernel body — freedom and constraints
+
+Available inside the kernel: every `tl.*` primitive from ADR-0046 D3.
+
+Common patterns:
+
+- `cube_id = tl.program_id(axis=1)` — this PE's cube index.
+- `pe_addr = t_ptr + cube_id * nbytes` — per-cube VA of the tile.
+- `acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")` — load local
+  data.
+- `tl.send(dir=...)` / `tl.recv(dir=..., shape=, dtype=)` — IPCQ
+  collective.
+- `acc = acc + recv` — TensorHandle arithmetic operators (ADR-0046 D4).
+- `tl.store(pe_addr, acc)` — store the result.
+
+The kernel body is plain Python — branching and loops are fine. But:
+
+- No SimPy `yield` or `async` (ADR-0046 D1).
+- No direct access to TensorHandle `.data` — the Phase 1 timing model
+  doesn't see data dependencies (ADR-0020's 2-pass separation).
+- Kernel execution must be deterministic — the same input must produce
+  the same op sequence. No random or external IO.
+
+### D5. SIP topology semantics — meaning of `sip_topo_kind`
+
+The backend looks up `topology.yaml`'s `system.sips.topology` string
+in the algorithm module's `TOPO_NAME_TO_KIND` and passes the integer
+as `sip_topo_kind`. The algorithm then branches:
+
+```python
+if sip_topo_kind == SIP_TOPO_RING:
+    acc = _inter_sip_ring(...)
+elif sip_topo_kind == SIP_TOPO_TORUS:
+    acc = _inter_sip_torus_2d(...)
+elif sip_topo_kind == SIP_TOPO_MESH:
+    acc = _inter_sip_mesh_2d(...)
+```
+
+Each topology branch communicates with peers via IPCQ direction names
+(`"global_E"`, `"W"`, `"S"`, `"N"` …). Direction semantics are defined
+in ADR-0023/0025; `configure_sfr_intercube_multisip` installs the IPCQ
+neighbor table accordingly.
+
+If a topology kind not supported by the algorithm appears, prefer an
+explicit `raise ValueError(f"unsupported topology kind
+{sip_topo_kind}")` over a silent no-op — fail fast on misconfiguration.
+
+### D6. The `ccl.yaml` algorithm entry
+
+The algorithm module is paired with a `ccl.yaml` entry (ADR-0023 D10 +
+ADR-0047 D3):
+
+```yaml
+defaults:
+  algorithm: lrab_hierarchical_allreduce
+  n_elem: 8
+
+algorithms:
+  lrab_hierarchical_allreduce:
+    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
+    # optional: world_size override
+    # optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip
+```
+
+- `module`: the full Python module path; `importlib.import_module`
+  consumes this string as-is.
+- `world_size` (optional): when set, overrides the topology fallback
+  (ADR-0047 D2).
+- Algorithm-specific parameters are consumed by
+  `configure_sfr_intercube_multisip`.
+
+Workflow to add a new algorithm:
+
+1. Write `src/kernbench/ccl/algorithms/<name>.py` following D1.
+2. Add the entry under `algorithms` in `ccl.yaml`.
+3. (If needed) extend `kernbench.ccl.sfr_config` with the SFR-install
+   branch.
+4. Add tests (e.g., `tests/sccl/test_<name>.py`, extending the
+   ADR-0043 eval harness).
+
+### D7. Legacy "rank = flat PE index" mode
+
+The `world_size` override in `ccl.yaml`, surfaced by ADR-0047 D2, is
+used by legacy "rank = flat PE index" tests. The algorithm module can
+assume `n_sips=world_size` ranks even in this mode — the backend
+maintains the rank↔(SIP, cube, PE) mapping, so no modal branching is
+needed inside the algorithm body.
+
+In single-cube workloads (where `cube_w=cube_h=1`), the algorithm must
+skip mesh-based phases — see the
+`single_cube = (cube_w == 1 and cube_h == 1)` pattern in
+`lrab_hierarchical_allreduce.py`.
+
+## Alternatives Considered
+
+### A1. Organize the algorithm module as a class (`class Allreduce: kernel(...)`)
+
+Rejected. The Python module namespace already identifies an algorithm
+(see ADR-0047 D3's `importlib.import_module`). A class wrapper adds
+indirection without simplifying dispatch. Module-level free functions
+plus a `kernel` alias are clean and obvious.
+
+### A2. Type `kernel_args` with an explicit dataclass
+
+Rejected (currently). Each algorithm normally has a different argument
+count; forcing one dataclass would hurt cross-algorithm interchange.
+The tuple return is simple and unpacks cleanly with the backend's
+`*kernel_args_tuple`. If an algorithm wants stronger internal typing,
+it may define its own NamedTuple.
+
+### A3. Move SFR installation inside the algorithm module
+
+Rejected. SFR installation
+(`configure_sfr_intercube_multisip`) is a cross-module decision
+combining topology + algorithm; `kernbench.ccl.sfr_config` is a more
+natural home than the algorithm module itself. D6's "extend
+sfr_config if needed" workflow keeps responsibility boundaries clear.
+
+### A4. Auto-register algorithm names via a decorator (analogous to ADR-0045's `@bench`)
+
+Rejected. Unlike benches, algorithms are already tied to `ccl.yaml`
+entries; an additional registry would be redundant. The string mapping
+in `module` is sufficient.
+
+## Consequences
+
+- ADR-0047 D3's one-line contract expands to a D1–D7 author-facing
+  guide; new algorithm signatures no longer need to be grep-derived.
+- D3's standardized 9 + tl signature couples naturally with the
+  backend's `extra_args` append (ADR-0047 D5). It is explicit that
+  even single-SIP-only algorithms must accept the four `sip_*` trailing
+  arguments.
+- D5's fail-loud recommendation means a `ccl.yaml` topology that the
+  algorithm doesn't support will surface as an explicit `ValueError`
+  rather than a silent wrong result.
+- D6's step-by-step addition workflow makes clear how far a new
+  algorithm has to reach into sfr_config / tests / ccl.yaml.
@@ -0,0 +1,288 @@
+# ADR-0051: Routing Helper API — `AddressResolver` + `PathRouter`
+
+## Status
+
+Accepted (2026-05-22).
+
+Pins down every public API, argument, return value, and adjacency-graph
+selection of the two helper classes (`AddressResolver`, `PathRouter`)
+exposed by `policy/routing/router.py`. ADR-0002 defines routing
+distance, ordering, and bypass rules, but **the helper API surface
+itself** has had no ADR-level coverage.
+
+## First action
+
+### `AddressResolver(graph)`
+
+On construction, caches two pieces of state:
+
+1. `self._node_ids = set(graph.nodes)` — a set of all node ids for
+   lookup.
+2. `self._hbm_slice_bytes = hbm_total_gb * (1 << 30) // slices_per_cube`
+   — derived from `graph.spec.cube.memory_map` (default `48 GB / 8
+   slices = 6 GB`). `resolve()` uses this value to decode `pe_id` from
+   an HBM PA's `hbm_offset`.
+
+In short, **AddressResolver's first act is "precompute the full set of
+node ids and the HBM slice size"**. It does not retain the graph
+itself.
+
+### `PathRouter(graph)`
+
+On construction, **builds four separate adjacency graphs in one pass**:
+
+1. `self._adj_all`: every edge (used for component-to-component
+   routing).
+2. `self._adj`: edges with `kind != "command"` (PE DMA / generic data
+   paths).
+3. `self._adj_mcpu_dma`: excludes
+   `_MCPU_DMA_EXCLUDE = {"pe_internal", "pe_to_router"}` (M_CPU DMA
+   must not pass through PE pipeline nodes).
+4. `self._adj_local`: excludes the 8-element `_UCIE_KINDS` set (UCIe
+   would look like a zero-distance bus to Dijkstra, which would prefer
+   it over the mesh — for cube-local routing this must be avoided).
+
+Each graph is a `defaultdict(list)` of `(neighbor, weight)`. The
+weight is `edge.routing_weight_mm or edge.distance_mm`.
+
+In short, **PathRouter's first act is "classify topology edges into
+four policy-specific adjacency lists simultaneously"**. Each `find_*()`
+call picks the appropriate graph and runs Dijkstra.
+
+## Context
+
+`policy/routing/router.py` performs two responsibilities together:
+
+- **Naming**: it is the sole owner of the topology naming convention
+  (`sip{S}.cube{C}.<comp>`, `sip{S}.io{I}.pcie_ep`, etc.). Components /
+  probe / IPCQ install / runtime API do not build node-id strings
+  themselves — they call helpers.
+- **Path decisions**: policy separation by `edge.kind`. For the same
+  src→dst, different routing intents (PE DMA vs M_CPU DMA vs general
+  component routing) call for different adjacencies and so produce
+  different paths.
+
+This helper API is widely consumed (probe.py / distributed.py /
+install.py / various components / tests), yet **the exact signatures /
+return semantics / adjacency picks** are not gathered in any ADR. This
+ADR closes that gap.
+
+## Decision
+
+### D1. `AddressResolver` exposes five public methods
+
+#### D1.1. `resolve(addr: PhysAddr) -> str`
+
+Translates a `PhysAddr` to a destination node id in the topology:
+
+```
+addr.kind == "hbm"             → f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
+  where pe_id = addr.hbm_offset // self._hbm_slice_bytes  (ADR-0017 D4/D9)
+
+addr.kind == "pe_resource":
+  addr.unit_type == PE         → f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
+  addr.unit_type == SRAM       → f"sip{s}.cube{d}.sram"
+  addr.unit_type == MCPU       → f"sip{s}.cube{d}.m_cpu"
+  others                       → RoutingError("unsupported unit_type")
+
+other kinds                    → RoutingError("unsupported address kind")
+```
+
+If the derived node id is not in `self._node_ids`, raises
+`RoutingError(f"node {node_id} not found in topology")`. So even when
+the address has valid syntax, an absent node in the topology
+fails-loud.
+
+#### D1.2. `find_m_cpu(sip, cube) -> str`
+
+Returns `f"sip{sip}.cube{cube}.m_cpu"`; absent → `RoutingError`.
+
+#### D1.3. `find_pcie_ep(sip, io_id="io0") -> str`
+
+Returns `f"sip{sip}.{io_id}.pcie_ep"`; absent → `RoutingError`.
+
+#### D1.4. `find_io_cpu(sip, io_id="io0") -> str`
+
+Returns `f"sip{sip}.{io_id}.io_cpu"`; absent → `RoutingError`.
+
+#### D1.5. `find_all_pcie_eps() -> list[str]`
+
+All PCIE_EP node ids across all SIPs, sorted. Filtered by
+`endswith(".pcie_ep")`. Cross-SIP IPCQ uses this when enumerating
+PCIE_EPs.
+
+This class is the sole owner of the naming convention
+(`sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`) — ADR-0015 D4.
+The topology builder produces nodes with the same naming convention;
+components never build node-id strings directly — they go through
+these helpers.
+
+### D2. `PathRouter`'s four adjacency graphs
+
+Constructed in one pass. `edge.kind` drives policy:
+
+| graph             | excluded edge kinds                                                                                                                   | use case                                       |
+|-------------------|--------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
+| `_adj_all`        | (none)                                                                                                                               | M_CPU↔NOC command included, IO_CPU/M_CPU routes |
+| `_adj`            | `"command"`                                                                                                                          | PE DMA / generic data paths                     |
+| `_adj_mcpu_dma`   | `"pe_internal"`, `"pe_to_router"`                                                                                                    | M_CPU DMA (skips PE pipeline)                   |
+| `_adj_local`      | `_UCIE_KINDS` (`ucie_internal`, `ucie_conn_to_router`, `router_to_ucie_conn`, `ucie_conn_to_noc`, `noc_to_ucie_conn`, `ucie_mesh`, `io_to_cube`, `cube_to_io`) | same-cube routing (UCIe bus excluded) |
+
+Each graph is `dict[node_id, list[(neighbor, weight)]]` with weight =
+`edge.routing_weight_mm or edge.distance_mm`. Excluding command edges
+prevents them from influencing routing; isolating `_adj_local` keeps
+UCIe's "zero-distance bus" from out-competing the mesh — consistent
+with ADR-0017 D7's cross-PE-slice mesh-distance requirement.
+
+### D3. `PathRouter` exposes six public methods (+ two backward-compat shims)
+
+#### D3.1. `find_path(src_pe: str, dst_node: str) -> list[str]`
+
+**PE DMA routing**. `src_pe` is a PE prefix (e.g.,
+`"sip0.cube0.pe0"`); the function auto-prepends `.pe_dma`, making the
+true start node `"sip0.cube0.pe0.pe_dma"`.
+
+Adjacency depends on cube-locality (`_same_cube`):
+
+- **Same-cube** (src and dst share `sip{S}.cube{C}.` prefix): uses
+  `_adj_local`. Excluding UCIe lets cross-PE-slice access pay accurate
+  mesh distance (ADR-0017 D7).
+- **Cross-cube**: uses `_adj`. UCIe naturally becomes the right choice
+  for the cross-cube portion.
+
+#### D3.2. `find_path_with_distance(src_pe, dst_node) -> tuple[list[str], float]`
+
+Same adjacency policy as D3.1, but returns `(path, total_distance)`.
+Used by probe and analysis tools that need the distance metric.
+
+#### D3.3. `find_mcpu_dma_path(m_cpu_id: str, dst_hbm_id: str) -> list[str]`
+
+**M_CPU DMA path**. Same cube → `_adj_local` (stay within the mesh);
+different cube → `_adj_all` (cross via UCIe). The
+`_MCPU_DMA_EXCLUDE` set ensures PE-pipeline nodes never appear on
+M_CPU's routes.
+
+#### D3.4. `find_memory_path(src: str, dst: str) -> list[str]`
+
+Direct memory path like
+`pcie_ep → io_noc → cube → router mesh → hbm_ctrl`. Uses
+`_adj_mcpu_dma` to exclude `pe_internal` and `pe_to_router`, so
+host-issued reads/writes never leak into the PE pipeline. Probe
+(ADR-0049 D1's H2D/D2H cases) calls this directly.
+
+#### D3.5. `find_node_path(src: str, dst: str) -> list[str]`
+
+Generic routing between arbitrary nodes, **including command edges**
+(via `_adj_all`). IoCpuComponent / MCpuComponent use this when they
+need to route through M_CPU ↔ NOC command-kind links.
+
+#### D3.6. Backward-compat shims
+
+- `_dijkstra(start, goal) -> list[str]` — thin wrapper for
+  `_run_dijkstra(self._adj, …)`.
+- `_dijkstra_with_dist(start, goal) -> tuple[list[str], float]` —
+  distance-aware variant.
+
+Despite the underscore prefixes (suggesting internal API), existing
+tests call these directly. New code should prefer D3.1–D3.5; these two
+shims are deprecation candidates.
+
+### D4. Dijkstra — single-source shortest path
+
+`_run_dijkstra_with_dist(adj, start, goal)`:
+
+- `heapq` priority queue.
+- `best: dict[node, distance]` — best known distance to each node.
+- `prev: dict[node, predecessor]` — for path reconstruction.
+- Edge weight = `routing_weight_mm or distance_mm`. The separation
+  matters because UCIe (and a few others) declare an explicit
+  `routing_weight_mm` distinct from physical `distance_mm`.
+
+`start == goal` short-circuits to `([start], 0.0)`. Unreachable target
+→ `RoutingError(f"no path from {start} to {goal}")`.
+
+The algorithm is **deterministic**: identical graph + start/goal gives
+the same path, satisfying SPEC R1 ("routing MUST be deterministic").
+Tie-breaks follow `heapq`'s push order (Python list order is
+deterministic).
+
+### D5. Single-owner principle for helper-API decisions
+
+The following decisions live only inside router.py:
+
+- Naming convention: `sip{S}.cube{C}.<comp>`,
+  `sip{S}.{io_id}.<comp>`,
+  `sip{S}.cube{C}.hbm_ctrl.pe{pe_id}`.
+- Adjacency policy: which edge kinds belong to which graph.
+- Algorithm for recovering PE id from an HBM slice size.
+- Dijkstra weight selection
+  (`routing_weight_mm or distance_mm`).
+
+Breaking single ownership (e.g., a component starting to build
+`f"sip{s}..."` itself) would explode the blast radius of naming-
+convention changes. This aligns with ADR-0015 D4.
+
+### D6. Consumers of the helper API
+
+Methods listed in this ADR are called from (current corpus):
+
+- `probes/probe.py` (ADR-0049): `find_pcie_ep`, `find_io_cpu`,
+  `find_m_cpu`, `find_node_path`, `find_mcpu_dma_path`,
+  `find_memory_path`, `find_path`, `resolve`.
+- `runtime_api/distributed.py` (ADR-0047): indirectly (engine-internal
+  routing).
+- `ccl/install.py` (ADR-0023): `find_all_pcie_eps`, `resolve`.
+- `sim_engine/event_log.py`: like probe — `find_pcie_ep`,
+  `find_memory_path`.
+- `components/builtin/m_cpu.py`, `components/builtin/io_cpu.py`:
+  `find_node_path`, `find_mcpu_dma_path`.
+- Tests (test_routing.py, test_cross_sip_routing.py, …): most of
+  D3.1–D3.5.
+
+When a new consumer arrives, D1/D3 act as a first-pass guide on
+whether an existing method matches the intent or a new one is needed.
+
+## Alternatives Considered
+
+### A1. One adjacency graph + per-call edge-kind filtering
+
+Rejected. Re-filtering the graph on every `find_*()` call hurts
+Dijkstra cache locality. Constructing four graphs in one pass (D2)
+has modest memory cost (edges ≤ a few × 10⁴), and selection happens
+in O(1) at call time.
+
+### A2. Drive adjacency separation by separate edge metadata rather than `kind`
+
+Rejected. `edge.kind` is already assigned by the topology builder
+(ADR-0015 D4 + ADR-0017); a parallel metadata field would force
+synchronization between two systems.
+
+### A3. Use BFS with uniform weights instead of Dijkstra
+
+Rejected. With per-edge `routing_weight_mm` (mesh link / UCIe /
+IO-internal), BFS minimizes hop count rather than total
+latency/distance. SPEC R1 + R2 require deterministic and accurate
+routing, which BFS does not deliver.
+
+### A4. Express the helper API as module functions instead of classes
+
+Rejected. Each class
+(`AddressResolver`, `PathRouter`) maintains caches
+(`_node_ids`, `_hbm_slice_bytes`, four adjacency graphs) reused across
+many routing queries on the same graph. Module functions would have
+to rebuild state per call or go global, hurting safety and
+performance.
+
+## Consequences
+
+- When components / probe / IPCQ install / runtime API all go through
+  router.py helpers, a naming-convention change (e.g., `.io0.` →
+  `.iochiplet0.`) is a one-file edit (D5).
+- D2's four-graph split is now ADR-locked, so when a new edge kind is
+  added (e.g., a new inter-die UCIe-link kind), the right adjacency
+  category is decided explicitly rather than by default.
+- D3.1's same-cube vs cross-cube branching (ADR-0017 D7) is explicit,
+  so anyone changing routing knows which adjacency to touch.
+- D6's consumer list bounds PR-review scope for helper-API changes,
+  and the backward-compat shims (D3.6) are flagged as deprecation
+  candidates.
@@ -0,0 +1,371 @@
+# ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals
+
+## Status
+
+Accepted (2026-05-22).
+
+Pins down the `OpRecord` schema and the `record_start` / `record_end` /
+`record_copy` behavior in `sim_engine/op_log.py`, plus the
+(space, addr) namespace and read/write semantics of `MemoryStore` in
+`sim_engine/memory_store.py`. ADR-0020 (2-pass data execution) declares
+that these two facilities exist, but **the precise record fields and
+semantics** had no ADR-level coverage, and several recent ADRs
+(ADR-0046 D3.2's `tl.store` visibility, ADR-0023 D9's IPCQ copy
+record) depend on these semantics.
+
+## First action
+
+### `OpLogger(memory_store=None)`
+
+On construction, initialize three fields:
+
+1. `self._records: list[OpRecord] = []` — accumulated records.
+2. `self._pending: dict[int, dict] = {}` — partial records keyed by
+   `id(msg)` (created at `record_start`, completed at `record_end`).
+3. `self._memory_store = memory_store` — optional MemoryStore
+   reference. Used to capture math-op input snapshots and dma_write
+   HBM-source snapshots.
+
+Records and pending are empty; the `record_*` calls accumulate data
+over time.
+
+### `MemoryStore()`
+
+On construction, initialize a single field:
+`self._storage: dict[str, dict[int, np.ndarray]] = {}` — a two-level
+dict (`space → addr → ndarray`). Inner dicts are created lazily as new
+spaces appear.
+
+In short, **both facilities' first act is "set up an empty accumulator
+buffer plus a sparse, per-space dict"**. The first record / write
+fills the fields when it arrives.
+
+## Context
+
+ADR-0020 D2/D5/D7 (2-pass data execution) declares:
+
+- During Phase 1 (timing), `ComponentBase._on_process_start/end` hooks
+  call `OpLogger.record_start/end`, recording the time and metadata of
+  every data op.
+- Phase 2 (data) replays the op log in `t_start` order to compute real
+  data.
+- Data payloads live in `MemoryStore`, keyed by (space, addr).
+
+Subsequent ADRs (ADR-0023 D9's IPCQ atomic write, ADR-0027's Megatron
+TP scratch-overwrite avoidance, ADR-0046 D3.2's `tl.store` visibility)
+depend on op_log and MemoryStore behavior, but **the exact record
+fields / space names / snapshot timing** are only discoverable via
+source grep. This ADR codifies them.
+
+## Decision
+
+### D1. `OpRecord` schema — seven fields
+
+```python
+@dataclass
+class OpRecord:
+    t_start: float
+    t_end: float
+    component_id: str
+    op_kind: str               # "memory" | "gemm" | "math" | "unknown"
+    op_name: str               # e.g. "dma_read", "gemm_f16", "exp",
+                               #     "TileToken/DMA_READ", "composite_gemm",
+                               #     "ipcq_copy"
+    params: dict[str, Any]
+    dependency_ids: list[int] = field(default_factory=list)
+```
+
+- **`t_start` / `t_end`**: SimPy time (float ns). `t_start` is when the
+  component begins the op; `t_end` is completion. Duration =
+  `t_end - t_start`.
+- **`component_id`**: the node id where the op occurred (e.g.,
+  `"sip0.cube0.pe0.pe_dma"`).
+- **`op_kind`**: one of four. Phase 2 DataExecutor branches on this.
+- **`op_name`**: a debug/analysis-friendly name. For a TileToken,
+  expands to `"TileToken/{stage_type}"` (e.g.,
+  `"TileToken/DMA_READ"`) to disambiguate stages.
+- **`params`**: op-specific metadata dict (see D3).
+- **`dependency_ids`**: currently unused (default `[]`). Reserved for
+  future cross-op dependency tracking.
+
+### D2. `OpLogger.records` — guaranteed `t_start` sort
+
+```python
+@property
+def records(self) -> list[OpRecord]:
+    self._records.sort(key=lambda r: r.t_start)
+    return self._records
+```
+
+A stable sort by `t_start` runs on each access. Records with the same
+`t_start` preserve insertion order. Aligns with ADR-0020 D5's
+"t_start stable ordering" requirement.
+
+Phase 2 DataExecutor always accesses via the `records` property, so
+even when `record_end` calls arrive out of `t_start` order (e.g., a
+short op started later but finished earlier), the sequence handed to
+Phase 2 is consistent.
+
+### D3. `params` schema per `op_name` (matrix from `_extract_op_info`)
+
+#### D3.1. `op_kind="memory", op_name="dma_read"` (DmaReadCmd)
+
+```python
+{"src_addr": int, "nbytes": int, "handle_id": str}
+```
+
+#### D3.2. `op_kind="memory", op_name="dma_write"` (DmaWriteCmd)
+
+```python
+{
+    "src_space": str,   # handle.space ("tcm"|"hbm"|"sram"), default "tcm"
+    "src_addr": int,    # handle.addr
+    "shape": tuple, "dtype": str,
+    "dst_space": "hbm", # DmaWrite always targets HBM
+    "dst_addr": int,
+    "nbytes": int,
+    "handle_id": str,
+    # When src_space == "hbm" at record_end, a snapshot is added (D4)
+    "snapshot": np.ndarray | None,
+}
+```
+
+#### D3.3. `op_kind="gemm", op_name=f"gemm_{dtype_a}"` (GemmCmd)
+
+```python
+{
+    "src_a_addr": int, "src_b_addr": int, "dst_addr": int,
+    "shape_a": tuple, "shape_b": tuple, "shape_out": tuple,
+    "dtype_in": str, "dtype_out": str,
+    "m": int, "k": int, "n": int,
+    # ADR-0027: per-operand + output spaces preserved
+    "src_a_space": str, "src_b_space": str, "dst_space": str,
+}
+```
+
+#### D3.4. `op_kind="math", op_name=msg.op` (MathCmd; op = "exp", "sum", "add", "where", …)
+
+```python
+{
+    "input_addrs": list[int],   # addrs of input handles
+    "input_shapes": list[tuple],
+    "input_spaces": list[str],
+    "input_dtypes": list[str],
+    "dst_addr": int, "dst_space": str,
+    "shape_out": tuple, "dtype": str,
+    "axis": int | None,         # only meaningful for reductions
+    # All inputs get snapshots at record_end (D4)
+    "input_snapshots": list[np.ndarray | None],
+}
+```
+
+#### D3.5. `op_kind="gemm" or "math", op_name=f"composite_{op}"` (CompositeCmd)
+
+```python
+{
+    "op": str,              # "gemm" | "math"
+    "out_addr": int, "out_nbytes": int,
+    # If op == "gemm", same fields as GemmCmd are added:
+    "src_a_addr": int, "src_b_addr": int,
+    "shape_a": tuple, "shape_b": tuple,
+    "dtype_in": str, "dtype_out": str,
+    "src_a_space": str, "src_b_space": str,
+    "dst_space": "hbm", "dst_addr": int,  # = out_addr
+}
+```
+
+If `op == "gemm"`, `op_kind = "gemm"`; otherwise `"math"`. An alias so
+Phase 2 replays composite-gemm on the same path as `GemmCmd`.
+
+#### D3.6. `op_kind="memory", op_name="ipcq_copy"` (record_copy path)
+
+```python
+{
+    "src_space": str, "src_addr": int,
+    "dst_space": str, "dst_addr": int,
+    "shape": tuple, "dtype": str, "nbytes": int,
+    "snapshot": np.ndarray | None,   # passed by caller; if None, record_copy reads fresh
+}
+```
+
+`PE_DMA._handle_ipcq_inbound` (ADR-0023 D9) emits this record so Phase
+2 can replay the IPCQ slot's inbound copy. It bypasses
+`record_start` / `record_end` and pushes directly via `record_copy()`.
+
+#### D3.7. `op_kind="unknown", op_name=type(msg).__name__`
+
+Fallback for messages `_extract_op_info` doesn't recognize. `params =
+{}`. If DataExecutor encounters this kind, it skips — Phase 2 replay
+is unaffected.
+
+### D4. Snapshot capture timing
+
+When `OpLogger._memory_store` is set, `record_end` performs:
+
+- **Math op**: read every input
+  (addr/shape/space/dtype) from `self._memory_store.read(...)` and
+  attach an ndarray copy to `params["input_snapshots"]`. Read failure
+  → `None`.
+- **`dma_write` op**: snapshot the source **only if `src_space ==
+  "hbm"`** and attach to `params["snapshot"]`. TCM (PE scratch)
+  sources are **deliberately skipped** — TCM is repopulated by Phase 2
+  math/gemm replay, and a Phase-1-time snapshot would capture a
+  previous kernel's stale value (ADR-0027 postmortem: TP gemm →
+  all_reduce race).
+- **`ipcq_copy`**: the caller passes the in-flight snapshot via
+  `snapshot=token.data`. If absent, `record_copy` attempts a fresh
+  read from MemoryStore.
+
+Snapshots are taken with `.copy()` (fresh allocation), making them
+safe against later storage mutation. This is the foundation of
+ADR-0027's "cross-PE Phase 2 ordering" race-avoidance.
+
+When `memory_store` is `None` (Phase 1 timing-only mode), all
+snapshot steps are skipped. Only the timing portion of the record is
+preserved; data replay is unavailable.
+
+### D5. TileToken handling — `record_start` captures stage info
+
+ADR-0014 D6's self-routing tile token (pipeline mode) may have already
+advanced its `stage_idx` by the time `record_end` runs (the TileToken
+caches the next stage's params as it moves to the next component).
+Therefore:
+
+`record_start` pre-saves the following in `pending[id(msg)]["snap"]`:
+
+```python
+snap["stage_type"] = stage.stage_type.name        # "DMA_READ", "GEMM", ...
+snap["stage_params"] = dict(stage.params)         # copy of params at start time
+```
+
+`record_end` retrieves this snap and merges into params:
+
+- Adds `params["stage_type"]` to final params.
+- Merges `stage_params` keys (keeps existing values if any).
+- If `op_name == "TileToken"`, rewrites it to
+  `f"TileToken/{stage_type}"` (e.g., `"TileToken/DMA_READ"`),
+  disambiguating different stages emitted by the same component.
+
+Thanks to this, DMA_READ vs DMA_WRITE, FETCH vs STORE coming from the
+same component (e.g., pe_dma) are distinguishable in reports.
+
+### D6. `MemoryStore` — two-level (space, addr) dict
+
+```python
+class MemoryStore:
+    def __init__(self) -> None:
+        self._storage: dict[str, dict[int, np.ndarray]] = {}
+
+    def write(self, space, addr, data): self._storage[space][addr] = data
+    def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ...
+    def has(self, space, addr) -> bool: ...
+    def snapshot(self) -> MemoryStore: ...
+```
+
+#### D6.1. Space namespace
+
+A string key. Standard values:
+
+- `"hbm"`: HBM data (deploy_tensor + Phase 2 dma_write results).
+- `"tcm"`: PE-local TCM (Phase 2 math/gemm output).
+- `"sram"`: cube-level SRAM (ADR-0023 D9.7's IPCQ slot tier).
+
+Other spaces (e.g., `"reg"`) are allowed — `_storage` is a lazy dict
+that creates a new space when `write` first touches it.
+
+#### D6.2. Address keying
+
+`addr` is an integer. It may be a **physical address (PA) or a virtual
+address (VA)** — `MemoryStore` itself doesn't know address-space
+semantics; it just uses them as keys. Phase 1's `MemoryWriteMsg`
+writes both PA and VA
+(`_create_tensor` zero-inits at PA and at the VA base too); Phase 2
+reads/writes via the addresses captured by op_log.
+
+The caller decides `addr`'s meaning — `MemoryStore` provides only
+lookup.
+
+#### D6.3. read/write semantics — reference store (no copy)
+
+`write(space, addr, data)`: stores the ndarray reference. **No copy.**
+If the caller later mutates the same ndarray, the stored value
+changes.
+
+`read(space, addr, shape=None, dtype=None)`: returns the stored
+ndarray reference. If `shape`/`dtype` are provided:
+
+- `dtype != stored.dtype`: `arr.view(np_dtype)` reinterprets as a
+  view (no copy).
+- `shape != stored.shape`: if `nbytes` matches, `arr.reshape(shape)`
+  is a view.
+- `nbytes` mismatch → `ValueError`.
+
+To detach the data, the caller must call `arr.copy()`. ADR-0027's
+race-avoidance requires explicit `.copy()` in op_log snapshot steps
+for exactly this reason.
+
+#### D6.4. `has(space, addr) -> bool`
+
+Existence check; does not materialize data.
+
+#### D6.5. `snapshot() -> MemoryStore`
+
+Shallow copy. Creates a new instance of inner dicts but shares
+ndarray references. Used at Phase 2 init to fork from Phase 1's
+store, so Phase 2 mutations don't affect Phase 1's remaining
+consumers.
+
+### D7. op_log assumes a single-threaded SimPy
+
+`OpLogger`'s `_records` and `_pending` are lock-free. SimPy is
+single-threaded, so nothing else can intrude between `record_start`
+and `record_end` for the same message.
+
+When multi-process kernbench (ADR-0047 D6) arrives, OpLogger must be
+split per process — one OpLogger instance cannot receive records from
+multiple processes.
+
+## Alternatives Considered
+
+### A1. Externalize op_log to SQLite / parquet
+
+Rejected (currently). The in-memory list minimizes Phase 1 → Phase 2
+hand-off latency. Externalization makes sense for long-running batch
+runs but adds overhead for the current single-run workload.
+
+### A2. Capture snapshots at `record_start`
+
+Rejected. At `record_start`, inputs are often not yet populated (e.g.,
+a math op's input is the output of a just-issued previous op).
+`record_end` is the correct point.
+
+### A3. Per-component MemoryStore
+
+Rejected. The (space, addr) key already disambiguates effectively, and
+splitting per component would complicate cross-PE IPCQ copy (ADR-0023
+D9), which needs access to both source and destination stores.
+
+### A4. Explicit dependency edges in op_log
+
+Partially adopted. The `dependency_ids` field exists on `OpRecord` but
+is currently unused (D1). Phase 2 DataExecutor orders via `t_start` +
+a secondary sort (memory ops before math at the same `t_start`). When
+an explicit dependency graph is required, this field is the home.
+Current ordering rules are sufficient, so it remains unused.
+
+## Consequences
+
+- ADR-0020's op_log / MemoryStore declarations are expanded into the
+  concrete D1–D6 schemas, so writing/modifying Phase 2 DataExecutor
+  doesn't need source-grep to learn field semantics.
+- D3's per-`op_name` params matrix makes adding new ops (e.g., a new
+  reduction type) a question of branching in `_extract_op_info`.
+- D4's per-op snapshot policy (math = input snapshot, dma_write =
+  HBM-only snapshot) is ADR-locked, so ADR-0027's race-avoidance
+  decision won't silently regress on future refactors.
+- D6.3's reference-store semantics are explicit, putting mutation
+  safety on the caller. ADR-0027's explicit `.copy()` pattern is
+  justified.
+- D7's single-thread assumption is recorded, so multi-process
+  kernbench (ADR-0047 D6's supersession candidate) will need OpLogger
+  separation when introduced.
@@ -0,0 +1,351 @@
+# ADR-0053: Topology Builder + Visualizer Algorithms
+
+## Status
+
+Accepted (2026-05-22).
+
+Pins down the key algorithmic choices of the topology compile and
+visualization pipeline jointly implemented by `topology/builder.py`,
+`topology/mesh_gen.py`, and `topology/visualizer.py` —
+placement-driven router attachment, mesh auto-layout, the source_hash
+cache, view projections, and SVG rendering. ADR-0006 defines the
+high-level intent of topology compilation (compiled topology, distance
+extraction, automatic diagram generation), but **which algorithms the
+builder actually uses** was only discoverable via source grep.
+
+## First action
+
+When `resolve_topology(path_str)` is called, four steps run in order:
+
+1. **Path validation** (`builder.py::resolve_topology`):
+   `Path(path_str).expanduser().resolve()`, existence check, file
+   check. Failure → `FileNotFoundError` or `ValueError`.
+2. **YAML parsing** (`_read_spec`): `yaml.safe_load`. Parse errors
+   yield a `ValueError` with line/column. Non-dict roots are
+   rejected.
+3. **Auto-generate the mesh** (`mesh_gen.ensure_mesh_file`): create or
+   reuse a `cube_mesh.yaml` next to the topology file. Cache hit on
+   matching source_hash; miss triggers regeneration. This step decides
+   the cube NoC's router grid and attachment information.
+4. **Compile the graph** (`_compile_graph`): system → IO chiplets →
+   cubes → inter-cube edges → IO↔cube edges → system↔IO edges, then
+   build four view projections (system, sip, cube, pe) and wrap into
+   a `TopologyGraph`.
+
+In short, **topology compilation's first act is "read topology.yaml as
+a dict, create/validate cube_mesh.yaml in the same directory, then
+build the flat graph + 4-view projection in system → sip → cube → pe
+order"**.
+
+## Context
+
+`topology/` package responsibilities:
+
+- **builder.py** (1207 lines): turns topology.yaml into a
+  `TopologyGraph` (nodes + edges + 4 view projections).
+- **mesh_gen.py** (305 lines): auto-decides the cube NoC's router
+  grid and PE/UCIe/M_CPU/SRAM attachment positions and caches them in
+  `cube_mesh.yaml`.
+- **visualizer.py** (887 lines): generates four SVG diagrams (system /
+  sip / cube / pe) from a `TopologyGraph`.
+
+ADR-0006 makes the high-level decision that "the result of topology
+compilation is the single source for distance metadata and diagram
+generation", but specific algorithms (e.g., placement-driven nearest-
+router attachment, the HBM exclusion zone, which fields in source_hash
+trigger regeneration) are not in any ADR.
+
+In particular, these decisions are absent at ADR level:
+
+- Why is mesh_gen cached in a separate file (`cube_mesh.yaml`)?
+- Which fields are in source_hash, and which changes force
+  regeneration?
+- Why placement coordinates in mm rather than cube coordinates?
+- How are the HBM exclusion zone and UCIe N/S/E/W distribution
+  decided inside the mesh?
+- What is the abstraction-level difference among the four view
+  projections (system/sip/cube/pe)?
+
+This ADR captures these decisions in one place.
+
+## Decision
+
+### D1. Compile pipeline — six stages
+
+`_compile_graph(spec)`:
+
+1. **System nodes** (`_instantiate_system`): add system-level nodes
+   like `fabric.switch0` and the host CPU.
+2. **Per-SIP loop** (`for sip_id in range(system.sips.count)`):
+   - **IO chiplets** (`_instantiate_io_chiplets`): create pcie_ep /
+     io_cpu / io_noc / io_ucie PHYs / conn nodes and their bidirectional
+     internal edges.
+   - **Cube instantiation** (`_instantiate_cube`): using
+     cube_mesh.yaml's router grid, instantiate cube routers, PE
+     sub-components (pe_cpu, pe_dma, pe_fetch_store, pe_gemm, pe_math,
+     pe_mmu, pe_tcm, pe_scheduler, pe_ipcq), m_cpu, sram, hbm_ctrl,
+     and their internal edges.
+   - **Inter-cube edges** (`_add_inter_cube_edges`): the UCIe
+     N/S/E/W mesh edges.
+   - **IO ↔ cube edges** (`_add_io_to_cube_edges`): connect io_noc to
+     each cube's edge UCIe phy.
+3. **Switch ↔ IO edges** (`_add_system_to_io_edges`): bidirectional
+   edges between `fabric.switch0` and each SIP's `pcie_ep` (the
+   cross-SIP IPCQ path of ADR-0038 D3 + ADR-0010).
+4. **Build four view projections**:
+   - `_build_system_view(spec)` — Tray level: SIPs and the system
+     switch.
+   - `_build_sip_view(spec)` — inside one SIP: cube mesh + IO
+     chiplet.
+   - `_build_cube_view(spec)` — inside one cube: router grid + PE /
+     M_CPU / SRAM / HBM_CTRL attachments.
+   - `_build_pe_view(spec)` — inside one PE: nine sub-components +
+     internal edges (pe_internal kind).
+5. **Return `TopologyGraph`**: `TopologyGraph(spec, nodes, edges,
+   system_view, sip_view, cube_view, pe_view)`.
+
+The six stages are **ordered for a reason**: only after cubes exist
+do inter-cube edges have valid src/dst, and IO chiplets must precede
+the IO ↔ cube edges that reference them. New node types must slot in
+the right spot.
+
+### D2. `cube_mesh.yaml` — a separate file with a source_hash cache
+
+`mesh_gen.ensure_mesh_file(cube_spec, mesh_path)`:
+
+1. Compute `source_hash = _compute_source_hash(cube_spec)` from these
+   input fields:
+   - `geometry` (cube_mm.w/h …).
+   - `pe_layout` (corners, pe_per_corner).
+   - `ucie.n_connections`.
+   - `memory_map.hbm_mapping_mode`.
+   - `placement` (m_cpu/sram pos_mm).
+2. If `mesh_path` (= `cube_mesh.yaml` next to topology.yaml) exists
+   and `existing.source_hash == source_hash`, reuse it (cache hit).
+3. Otherwise, generate a new mesh via
+   `_generate_mesh(cube_spec, source_hash)` and write to yaml.
+
+Caching as a separate file because:
+
+- Mesh generation involves nontrivial PE/UCIe/router attachment math
+  and is too expensive to redo every time.
+- Multiple runs with the same cube spec must guarantee an identical
+  mesh.
+- The resulting mesh is itself an inspectable / debuggable artifact.
+
+The five fields listed in source_hash are the ones that determine
+mesh shape; other changes (e.g., bandwidth, overhead_ns) do not
+trigger mesh regeneration.
+
+### D3. Cube NoC mesh auto-layout
+
+`_generate_mesh(cube_spec)`:
+
+#### D3.1. Rows / columns
+
+- `pe_positions = _corner_pe_positions(cube_w, cube_h)`: PE-center
+  coordinates (mm) per corner (NW/NE/SW/SE). Hardcoded patterns like
+  `(1.5, 1.5)` and `(cube_w-1.5, cube_h-1.5)`; with `pe_per_corner=2`,
+  each corner has two PE positions.
+- `col_xs = _compute_col_positions(...)`: union of PE x-coordinates,
+  plus relay columns inserted when any gap exceeds
+  `max_spacing = 3.0 mm`.
+- `row_ys, rows_per_half = _compute_row_positions(cube_h,
+  n_connections, pe_positions)`:
+  - `n_conn = max(n_connections, 2)` (hot-path minimum).
+  - `rows_per_half = ceil(n_conn / 2)`.
+  - Top half + two HBM rows + bottom half. HBM sits at
+    `(cube_h/2 - 1.5, cube_h/2 + 1.5)`. The gap between PE rows and
+    HBM rows is `hbm_gap = 1.5 mm`.
+
+#### D3.2. HBM exclusion zone
+
+`hbm_row_start = rows_per_half`,
+`hbm_row_end = rows_per_half + 1`.
+`hbm_col_start = n_cols // 2 - 1`,
+`hbm_col_end = n_cols // 2`.
+
+Router slots inside this (row, col) rectangle are marked `None` (no
+router). HBM controllers are added separately as
+`hbm_ctrl.pe{X}` nodes following ADR-0017 D9's per-PE partition
+pattern.
+
+#### D3.3. PE attachment
+
+Each corner's PEs map to a row:
+
+- Top half: NW → row 0, NE → row 1 (top_corners index).
+- Bottom half: SW → row `hbm_row_end + 1`, SE → row
+  `hbm_row_end + 2`.
+
+Each PE's x-coordinate attaches to the nearest column's router
+(`min(range(n_cols), key=lambda c: abs(col_xs[c] - pe_x))`).
+Attachment items are `pe{pe_idx}.dma`, `pe{pe_idx}.cpu`,
+`pe{pe_idx}.hbm` (pushed into the router's attach list).
+
+#### D3.4. M_CPU / SRAM attachment — nearest router by Euclidean distance
+
+For `placement.m_cpu.pos_mm` (default `[1.5, 5.5]`) and
+`placement.sram.pos_mm` (default `[1.5, 8.5]`), find the router with
+the smallest Euclidean distance and append `"m_cpu"` / `"sram"` to
+its attach list.
+
+#### D3.5. UCIe N/S/E/W distribution
+
+`ucie_pe_rows = top_pe_rows + bot_pe_rows` (total
+`2 * rows_per_half`).
+
+- UCIe-E: one PE row at a time, attach `ucie_e.c{i}` to the rightmost
+  column's router.
+- UCIe-W: attach `ucie_w.c{i}` to the leftmost column's router (E's
+  mirror).
+- UCIe-N/S: split PE columns into left and right halves; attach to
+  the top row's / bottom row's matching columns.
+
+Each UCIe connection is suffixed `c{i}`, distributing
+ucie_n_connections PHYs (ADR-0017 D5+).
+
+### D4. Node naming convention — single ownership
+
+builder.py creates nodes with the following naming convention (the
+single-owner principle from ADR-0051 D5):
+
+- `fabric.switch0` — system-level switch.
+- `sip{S}.{io_id}.{pcie_ep|io_cpu|io_noc|io_ucie.{dir}|conn.{id}}` —
+  IO chiplet.
+- `sip{S}.cube{C}.{m_cpu|sram|hbm_ctrl.pe{X}|noc.r{R}c{C}|...}` —
+  inside cube.
+- `sip{S}.cube{C}.pe{P}.{pe_cpu|pe_dma|pe_fetch_store|pe_gemm|pe_math|pe_mmu|pe_tcm|pe_scheduler|pe_ipcq}` —
+  PE sub-components.
+
+Changing this convention requires updating both builder.py and
+router.py's helpers (ADR-0051). Components never know the convention
+directly — they only call the helpers.
+
+### D5. Edge `kind` classification
+
+Every edge gets a `kind`; routing policy (ADR-0051 D2) reads it. Major
+kinds:
+
+- `"pe_internal"` — within a PE between sub-components.
+- `"pe_to_router"` — PE_DMA ↔ cube NoC router.
+- `"router_mesh"` — between cube NoC routers.
+- `"router_to_hbm"`, `"router_to_mcpu"`, `"router_to_sram"`,
+  `"sram_to_router"`, etc. — between cube-attached components.
+- `"ucie_internal"`, `"ucie_conn_to_router"`,
+  `"router_to_ucie_conn"`, `"ucie_conn_to_noc"`,
+  `"noc_to_ucie_conn"`, `"ucie_mesh"` — UCIe-related.
+- `"io_internal"` — inside IO chiplet.
+- `"io_to_cube"`, `"cube_to_io"` — at the IO ↔ cube boundary.
+- `"pcie"` — switch ↔ pcie_ep.
+- `"command"` — control-plane edges only (e.g., M_CPU ↔ NOC; excluded
+  from PE DMA paths).
+
+Adding a new edge kind requires picking a category in router.py's
+four adjacency graphs (ADR-0051 D2). If you forget, it defaults to
+`_adj_all` only, which can produce unintended routes.
+
+### D6. View projection — four abstraction levels
+
+`TopologyGraph` keeps four view projections alongside the flat
+nodes+edges:
+
+- **system_view** (`_build_system_view`): Tray level. SIP blocks and
+  `fabric.switch0`. PCIe links shown. For external high-level
+  overview.
+- **sip_view** (`_build_sip_view`): inside one SIP — cube mesh + IO
+  chiplet (pcie_ep + io_cpu + io_noc). UCIe N/S/E/W appear as
+  cube-cube links.
+- **cube_view** (`_build_cube_view`): inside one cube — router grid +
+  PE / M_CPU / SRAM / HBM_CTRL attachments + UCIe PHY edges. For
+  intra-cube routing / placement debugging.
+- **pe_view** (`_build_pe_view`): inside one PE — nine sub-components
+  + internal edges (pe_internal kind). For detailed PE-internal
+  dataflow review.
+
+Views are selectively rendered via the spec's
+`visualization.emit_views: [system, sip, cube]` (ADR-0006). The pe
+view is omitted from default output but the code is retained for
+detailed debugging.
+
+### D7. visualizer.py — SVG diagram output
+
+`emit_diagrams(graph, out_dir)` renders every view as SVG. Key
+functions:
+
+- `_render_view_svg(view)` — generic view render (no router grid).
+- `_render_cube_view_svg(view, spec)` — cube-view specific (HBM block,
+  router grid layout, PE/M_CPU/SRAM/HBM placement).
+- `_draw_node`, `_draw_edge` — node/edge visual representation.
+- `_pick_scale`, `_compute_node_sizes` — auto-scaling.
+
+The visualizer is a **derived artifact** (ADR-0006); changes here do
+not pass production checks. Aligns with CLAUDE.md's "Derived
+Artifacts" guidance.
+
+### D8. Blast radius of spec changes
+
+| spec field                            | effect              | mesh regenerated? |
+|---------------------------------------|---------------------|-------------------|
+| `system.sips.count`                   | SIP count, node count | No                |
+| `sip.cube_mesh.w/h`                   | cube mesh shape     | No                |
+| `cube.geometry.cube_mm.w/h`           | cube size (mm)      | **Yes**           |
+| `cube.pe_layout.corners/pe_per_corner`| PE attachment positions | **Yes**       |
+| `cube.ucie.n_connections`             | UCIe PHY distribution | **Yes**         |
+| `cube.memory_map.hbm_mapping_mode`    | HBM distribution mode | **Yes**         |
+| `cube.placement`                      | M_CPU/SRAM positions | **Yes**          |
+| `cube.memory_map.*` (besides above)   | HBM capacity / BW   | No                |
+| `*.links.*.bw_gbs`                    | edge bandwidth      | No                |
+| `*.attrs.overhead_ns`                 | component latency   | No                |
+
+The table mirrors D2's `_compute_source_hash` inputs. Changes that
+require mesh regeneration automatically invalidate `cube_mesh.yaml`'s
+source_hash.
+
+## Alternatives Considered
+
+### A1. Regenerate the mesh on every compile without a cache file
+
+Rejected. The cost of mesh generation would be paid repeatedly (CLI
+runs, probe, tests) for the same spec, and the human-inspectable
+artifact would disappear.
+
+### A2. Merge mesh generation into builder.py
+
+Rejected (currently). It is a 305-line algorithm of its own, and the
+mesh-layout decisions (placement-driven router attachment, HBM
+exclusion zone) are different from builder's general node/edge
+emission. Keeping it separate respects single-responsibility.
+
+### A3. Express placement coordinates in cube coordinates (col/row)
+
+Rejected. mm coordinates flow consistently between the visualizer and
+mesh layout (for nearest-router computation). Cube coordinates are
+undefined until the router grid is fixed, so they are unsuitable as
+placement input.
+
+### A4. Lazy view projection generation
+
+Rejected (currently). The four views are cheap to build (typically <
+100 ms), and eager construction guarantees `TopologyGraph` as the
+single source of truth.
+
+### A5. Visualizer output in formats besides SVG (PNG/PDF)
+
+Rejected. SVG is vector + text-searchable + directly renderable in
+browsers. PNG conversion, when required, is downstream
+post-processing (e.g., rsvg-convert).
+
+## Consequences
+
+- ADR-0006's high-level intent is fleshed out via D1–D7; topology
+  changes can be assessed quickly via D8's table.
+- D3's mesh-layout algorithm is ADR-locked, so future PE attachment
+  patterns (e.g., a 6-zone HBM split) make clear which stage they
+  affect.
+- D5's edge-kind list and D7's view structure are explicit, giving PR
+  reviewers a quick map of where (builder + router + visualizer) a
+  new component type ripples through.
+- D2's source_hash invalidation rules are explicit, so a stale
+  `cube_mesh.yaml` (e.g., when only bandwidth changed) is recognized
+  as correct behavior.
@@ -0,0 +1,143 @@
+# ADR-0054: Milestone Eval Benches — self-contained sweep + figure benches
+
+## Status
+
+Accepted (2026-05-22).
+
+Amends ADR-0044 (D1/D2) and ADR-0045 (D5) and supersedes the "logic lives
+in `scripts/` + `tests/`" arrangement of ADR-0043/0044: the GEMM and
+allreduce evaluation harnesses are now self-contained **benches** that a
+user runs to regenerate every result + figure.
+
+## Context
+
+ADR-0043 (allreduce eval) and ADR-0044 (GEMM eval) split each harness into
+a **sweep** (a manual `scripts/` driver, or — for allreduce — the
+parametrized tests themselves) plus **figure tests** that render committed
+data. The sweep/render logic therefore lived under `scripts/gemm_sweep.py`,
+`tests/gemm/_gemm_plot_helpers.py`, and `tests/sccl/_allreduce_helpers.py`.
+
+A milestone requirement ("refactor allreduce + GEMM evaluation so a user
+can run *one bench* to generate all the results and plots") cannot be met
+by that layout: a bench is production code and **must not import from
+`tests/`** (ADR-0007 layer direction). The eval logic had to move into
+production, reachable from a bench.
+
+The chosen home is the bench module itself — not a separate
+`kernbench.eval` package. A bench file may contain arbitrary module-level
+code; collapsing the harness into the bench keeps one file per domain and
+avoids an extra package layer.
+
+## Decision
+
+### D1. Two milestone benches own the eval logic
+
+- `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep +
+  the three figure renderers (moved from `scripts/gemm_sweep.py` +
+  `tests/gemm/_gemm_plot_helpers.py`).
+- `src/kernbench/benches/milestone_1h_ccl.py` — the distributed allreduce
+  driver, latency + buffer-kind sweeps, topology diagram, FSIM comparison,
+  and the direct-launch parity reference (moved from
+  `tests/sccl/_allreduce_helpers.py`).
+
+Each file is the **single home** for its domain's eval logic.
+
+### D2. The "eval bench" pattern (extends ADR-0045 D5)
+
+ADR-0045 D5 fixed a bench to a single configuration (single-SIP, or the
+ADR-0024 multi-SIP CCL exception). This ADR adds a third pattern:
+
+- An **eval bench** may drive *many* configurations and render figures. It
+  builds its own per-config `GraphEngine` / `RuntimeContext` instances
+  (one per sweep point) rather than using the outer `run_bench` engine.
+- Because the outer ctx then has no submitted handles, the bench submits a
+  **sentinel tensor** (`torch.zeros((1, 1), …)`) at the end to satisfy
+  `run_bench`'s "must submit at least one request" contract (ADR-0045 D4),
+  so the CLI exits 0.
+
+### D3. Output location
+
+Both benches write to `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`
+(per user request — artifacts beside the bench). The directory holds only
+generated PNG/CSV/JSON (never a `.py`/`__init__.py`), so the eager-import
+audit (ADR-0045 first action) ignores it — `pkgutil.iter_modules` does not
+yield non-package subdirectories. It is **committed** (like the
+`docs/diagrams/` artifacts) so the figures are viewable on the remote;
+rerunning the bench regenerates it in place.
+
+### D4. GEMM heavy sweep — fresh by default, `MILESTONE_FAST` to reuse
+
+`milestone-1h-gemm` runs the full 24-sim sweep by default (minutes; one
+shape is 2048 tiles). `MILESTONE_FAST=1` reuses the committed
+`docs/diagrams/gemm_sweep.json` and only re-renders (seconds). This
+reverses ADR-0044 D1/D2's "heavy sweep stays a manual/`slow`-marked step":
+running the bench *is* the regeneration. The slow path is exercised by a
+`@pytest.mark.slow` bench test; the fast path runs by default.
+
+### D5. Tests + script reuse via thin re-export shims (single home kept)
+
+The pre-existing figure tests and the `scripts/gemm_sweep.py` entry point
+are retained and now reuse the bench modules:
+
+- `tests/gemm/_gemm_plot_helpers.py` → re-exports the renderers +
+  `GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT` from
+  `kernbench.benches.milestone_1h_gemm`.
+- `tests/sccl/_allreduce_helpers.py` → re-exports the driver core, config
+  writers, sweep constants, renderers, and disk aggregators from
+  `kernbench.benches.milestone_1h_ccl`, and keeps the **pytest-only** pieces
+  local: the `pytest.param` matrices (`CONFIGS` / `_sweep_params` /
+  `_bk_params`) and the fixture-coupled `_run_distributed`
+  (`monkeypatch.chdir` + `_drive_distributed`) wrapper.
+- `scripts/gemm_sweep.py` → thin wrapper over the bench's `run_sweep`.
+
+Tests importing a bench module is permitted (tests sit above production,
+ADR-0007); it triggers the whole-package eager audit, which already runs on
+every `kernbench` invocation. matplotlib stays lazily imported inside the
+renderers, so the audit's startup cost is unchanged.
+
+### D6. Flat module naming (no `benches/` subfolder)
+
+A `benches/` subpackage named `1H_milestone…` is impossible — a Python
+package name cannot start with a digit. The benches are therefore flat
+modules `milestone_1h_gemm.py` / `milestone_1h_ccl.py` with bench names
+`milestone-1h-gemm` / `milestone-1h-ccl` (kebab-case, letter-first per
+ADR-0045 D1).
+
+## Consequences
+
+### Positive
+
+- `kernbench run --bench milestone-1h-gemm` (or `…-ccl`) regenerates all of
+  a domain's results + figures in one command — the milestone requirement.
+- Single source for the eval logic (the bench), reused by tests and the
+  script via shims; no duplication.
+- The figure tests and `scripts/gemm_sweep.py` keep working unchanged.
+
+### Negative / limitations
+
+- The two bench files are large (the CCL one mixes the distributed driver,
+  sweeps, and matplotlib drawing). A "bench" that is mostly an eval harness
+  is unusual; this ADR legitimizes it.
+- Generated artifacts live inside the source tree (`src/kernbench/benches/`)
+  by explicit request and are committed (so the figures are viewable on the
+  remote); rerunning the bench regenerates them.
+- `milestone-1h-ccl` (and the default `milestone-1h-gemm`) take minutes —
+  acceptable for an on-demand milestone artifact, not for routine runs.
+
+## Dependencies
+
+- **ADR-0007**: layer direction (why tests may import production but a bench
+  may not import tests).
+- **ADR-0043 / ADR-0044**: the allreduce / GEMM eval harnesses this ADR
+  relocates into benches.
+- **ADR-0045**: bench module contract; D2 here extends its D5 (single-device
+  rule) with the eval-bench pattern, and relies on D4 (NO_REQUESTS) for the
+  sentinel.
+- **ADR-0024**: rank = SIP launcher driven by the allreduce sweeps.
+
+## Open questions
+
+- Should the GEMM theoretical-model constants (ADR-0044 D5) be sourced from
+  ADR-0033/0014 rather than copied? Unchanged by this ADR.
+- Should `build_overview_slides.py` consume the milestone output PNGs
+  instead of drawing GEMM bars natively? Still open (ADR-0044 D6 / Negative).
@@ -0,0 +1,175 @@
+# ADR Index
+
+Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **47**.
+
+Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
+
+## Design Principles
+
+- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — Verification Strategy and Phase 1 Test Plan
+- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — Latency Model: Assumptions and Known Simplifications
+
+## High-level Architecture
+
+- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — Target System Hierarchy & Modeling Scope  _(System hierarchy (Tray / SIP / CUBE / PE))_
+- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — Runtime API and Simulation Engine Boundaries  _(Runtime API ↔ sim_engine boundaries)_
+- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NOC and Memory Data Path  _(IOChiplet NOC and memory data path)_
+- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — Cube NOC and HBM Connectivity  _(Cube NOC and HBM connectivity)_
+
+## Detailed Architecture
+
+One subsection per component file under `src/kernbench/components/builtin/`.
+
+### forwarding
+
+- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding Component (forwarding_v1)
+
+### hbm_ctrl
+
+- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM Controller Internal Design
+
+### io_cpu
+
+- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU Component Model
+
+### m_cpu
+
+- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU and M_CPU.DMA Component Model
+
+### pcie_ep
+
+- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
+
+### pe_cpu
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+
+### pe_dma
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
+
+### pe_fetch_store
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+
+### pe_gemm
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+
+### pe_ipcq
+
+- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
+
+### pe_math
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+
+### pe_mmu
+
+- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — Component + Utility Dual Role
+
+### pe_scheduler
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+
+### pe_tcm
+
+- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — Dual-Channel BW Serialization
+
+### sram
+
+- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
+
+### tiling
+
+- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math Pipeline Plan Builders
+
+## Implementation Decisions
+
+### Address Scheme
+
+- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51-bit Physical Address Layout & Decoding Contract
+- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — Memory Addressing — PA / VA / LA Address Models
+
+### Routing & Helper API
+
+- [ADR-0002](./ADR-0002-lat-routing-distance.md) — Routing Distance, Ordering & Bypass Rules
+- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
+
+### Memory Semantics & Local-HBM Bandwidth
+
+- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — Memory Semantics & Local-HBM Bandwidth Guarantee
+
+### Topology Compilation, Diagrams & Builder Algorithms
+
+- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — Diagram Views & Distance-Aware Layout Rules
+- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — Topology Compilation, Distance Extraction, and Automatic Diagram Generation
+- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
+
+### Tensor Deployment and Allocation
+
+- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — Tensor Deployment and Allocation (Host Allocator, PA-first)
+
+### Kernel Execution and Host-Device Messaging
+
+- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — Kernel Execution Messaging and Completion Semantics
+- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
+
+### CLI Surface and Semantics
+
+- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — Command Line Interface and Execution Semantics
+
+### Component Port/Wire Fabric Model
+
+- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — Component Port/Wire Model and Fabric Routing
+
+### Two-Pass Data Execution
+
+- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass Data Execution Model (Timing / Data Separation)
+
+### 2D Grid Program Identity
+
+- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D Grid program_id Semantics
+
+### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
+
+- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
+- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — remove sip/num_sips fields
+- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
+- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
+- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
+
+### IPCQ Direction Addressing
+
+- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
+
+### Intercube All-Reduce
+
+- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
+
+### Evaluation Harnesses
+
+- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce Evaluation Harness — `tests/sccl/`
+- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/`
+- [ADR-0054](./ADR-0054-eval-milestone-benches.md) — Milestone Eval Benches — self-contained sweep + figure benches
+
+### Bench Module Contract
+
+- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
+
+### Kernel-side tl.* API (TLContext)
+
+- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
+
+### Memory Allocator Algorithms
+
+- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
+
+### Probe Subcommand
+
+- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
+
+### Sim-engine Op Log and Memory Store Schemas
+
+- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
@@ -0,0 +1,13 @@
+buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns
+hbm,torus_2d,6,128,256,2120.040000000012
+hbm,torus_2d,6,1024,2048,2717.2783333333473
+hbm,torus_2d,6,8192,16384,7315.184999999989
+hbm,torus_2d,6,32768,65536,23081.26500000037
+sram,torus_2d,6,128,256,2060.040000000012
+sram,torus_2d,6,1024,2048,2909.2783333333473
+sram,torus_2d,6,8192,16384,9523.184999999869
+sram,torus_2d,6,32768,65536,32201.265000000385
+tcm,torus_2d,6,128,256,1964.040000000012
+tcm,torus_2d,6,1024,2048,2477.2783333333473
+tcm,torus_2d,6,8192,16384,6403.185000000109
+tcm,torus_2d,6,32768,65536,19865.265000000378
@@ -1,13 +0,0 @@
-buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns
-hbm,torus_2d,6,128,256,2120.0399999999754
-hbm,torus_2d,6,1024,2048,2716.74499999995
-hbm,torus_2d,6,8192,16384,7315.185000000081
-hbm,torus_2d,6,32768,65536,23081.265000008738
-sram,torus_2d,6,128,256,2060.0399999999754
-sram,torus_2d,6,1024,2048,2908.74499999995
-sram,torus_2d,6,8192,16384,9523.185000000081
-sram,torus_2d,6,32768,65536,32201.265000008752
-tcm,torus_2d,6,128,256,1964.0399999999754
-tcm,torus_2d,6,1024,2048,2476.74499999995
-tcm,torus_2d,6,8192,16384,6403.185000000081
-tcm,torus_2d,6,32768,65536,19865.265000008738
@@ -1,37 +1,37 @@
 algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
-intercube_allreduce,mesh_2d_no_wrap,6,8,16,256,2666.5524999999725
-intercube_allreduce,mesh_2d_no_wrap,6,32,64,1024,2747.7399999999725
-intercube_allreduce,mesh_2d_no_wrap,6,64,128,2048,2855.98999999998
-intercube_allreduce,mesh_2d_no_wrap,6,128,256,4096,3072.4899999999725
-intercube_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3336.579999999951
-intercube_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3707.49999999992
-intercube_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4449.339999999875
-intercube_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,5933.020000000055
-intercube_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,8900.380000000157
-intercube_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,14835.099999997583
-intercube_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,26704.540000017492
-intercube_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,38573.980000026335
-intercube_allreduce,ring_1d,6,8,16,256,2365.2558333333036
-intercube_allreduce,ring_1d,6,32,64,1024,2436.9433333333036
-intercube_allreduce,ring_1d,6,64,128,2048,2532.526666666643
-intercube_allreduce,ring_1d,6,128,256,4096,2723.6933333333036
-intercube_allreduce,ring_1d,6,512,1024,16384,3042.0349999999544
-intercube_allreduce,ring_1d,6,1024,2048,32768,3390.201666666597
-intercube_allreduce,ring_1d,6,2048,4096,65536,4079.7349999998714
-intercube_allreduce,ring_1d,6,4096,8192,131072,5458.801666666721
-intercube_allreduce,ring_1d,6,8192,16384,262144,8216.93500000014
-intercube_allreduce,ring_1d,6,16384,32768,524288,13733.201666664638
-intercube_allreduce,ring_1d,6,32768,65536,1048576,24765.735000014545
-intercube_allreduce,ring_1d,6,49152,98304,1572864,35798.268333355256
-intercube_allreduce,torus_2d,6,8,16,256,1700.6024999999754
-intercube_allreduce,torus_2d,6,32,64,1024,1753.2899999999754
-intercube_allreduce,torus_2d,6,64,128,2048,1823.539999999979
-intercube_allreduce,torus_2d,6,128,256,4096,1964.0399999999754
-intercube_allreduce,torus_2d,6,512,1024,16384,2196.2849999999653
-intercube_allreduce,torus_2d,6,1024,2048,32768,2476.74499999995
-intercube_allreduce,torus_2d,6,2048,4096,65536,3037.664999999919
-intercube_allreduce,torus_2d,6,4096,8192,131072,4159.50500000003
-intercube_allreduce,torus_2d,6,8192,16384,262144,6403.185000000081
-intercube_allreduce,torus_2d,6,16384,32768,524288,10890.544999998769
-intercube_allreduce,torus_2d,6,32768,65536,1048576,19865.265000008738
-intercube_allreduce,torus_2d,6,49152,98304,1572864,28839.985000013185
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,8,16,256,2666.552500000015
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,32,64,1024,2747.7400000000152
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,64,128,2048,2855.990000000018
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,128,256,4096,3072.490000000019
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3337.1133333333582
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3708.0333333333692
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4449.873333333393
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,5933.020000000124
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,8900.379999999863
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,14835.099999999224
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,26704.540000000765
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,38573.97999999701
+lrab_hierarchical_allreduce,ring_1d,6,8,16,256,2365.255833333347
+lrab_hierarchical_allreduce,ring_1d,6,32,64,1024,2436.9433333333473
+lrab_hierarchical_allreduce,ring_1d,6,64,128,2048,2532.526666666683
+lrab_hierarchical_allreduce,ring_1d,6,128,256,4096,2723.693333333349
+lrab_hierarchical_allreduce,ring_1d,6,512,1024,16384,3048.635000000021
+lrab_hierarchical_allreduce,ring_1d,6,1024,2048,32768,3393.4016666666957
+lrab_hierarchical_allreduce,ring_1d,6,2048,4096,65536,4082.401666666714
+lrab_hierarchical_allreduce,ring_1d,6,4096,8192,131072,5458.80166666677
+lrab_hierarchical_allreduce,ring_1d,6,8192,16384,262144,8216.934999999943
+lrab_hierarchical_allreduce,ring_1d,6,16384,32768,524288,13733.201666665835
+lrab_hierarchical_allreduce,ring_1d,6,32768,65536,1048576,24765.73500000064
+lrab_hierarchical_allreduce,ring_1d,6,49152,98304,1572864,35798.268333331536
+lrab_hierarchical_allreduce,torus_2d,6,8,16,256,1700.6025000000095
+lrab_hierarchical_allreduce,torus_2d,6,32,64,1024,1753.2900000000102
+lrab_hierarchical_allreduce,torus_2d,6,64,128,2048,1823.540000000012
+lrab_hierarchical_allreduce,torus_2d,6,128,256,4096,1964.040000000012
+lrab_hierarchical_allreduce,torus_2d,6,512,1024,16384,2196.8183333333463
+lrab_hierarchical_allreduce,torus_2d,6,1024,2048,32768,2477.2783333333473
+lrab_hierarchical_allreduce,torus_2d,6,2048,4096,65536,3038.1983333333583
+lrab_hierarchical_allreduce,torus_2d,6,4096,8192,131072,4159.5050000000665
+lrab_hierarchical_allreduce,torus_2d,6,8192,16384,262144,6403.185000000109
+lrab_hierarchical_allreduce,torus_2d,6,16384,32768,524288,10890.5449999995
+lrab_hierarchical_allreduce,torus_2d,6,32768,65536,1048576,19865.265000000378
+lrab_hierarchical_allreduce,torus_2d,6,49152,98304,1572864,28839.98500000059
@@ -1,81 +1,81 @@
 hop,label,size_bytes,path,total_ns
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,ipcq,24.88749999999891
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),128,raw,33.57999999999811
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,ipcq,28.13749999999891
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),256,raw,36.07999999999811
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,ipcq,29.88749999999891
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),384,raw,37.07999999999811
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,ipcq,31.63749999999891
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),512,raw,38.07999999999811
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,ipcq,35.13749999999891
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),768,raw,40.07999999999811
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,ipcq,38.63749999999891
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),1024,raw,42.07999999999811
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,ipcq,52.63749999999891
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),2048,raw,50.07999999999811
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,ipcq,80.63750000000073
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),4096,raw,66.08000000000175
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,ipcq,136.63750000000073
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),8192,raw,98.08000000000175
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,ipcq,164.63750000000073
-h1_intra_horizontal,Intra-cube horizontal (pe0 to pe1),10240,raw,114.08000000000175
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,ipcq,38.49749999999585
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),128,raw,47.18999999999505
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,ipcq,43.24749999999585
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),256,raw,51.18999999999505
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,ipcq,44.99749999999585
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),384,raw,52.18999999999505
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,ipcq,46.74749999999585
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),512,raw,53.18999999999505
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,ipcq,50.24749999999585
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),768,raw,55.18999999999505
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,ipcq,53.74749999999585
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),1024,raw,57.18999999999505
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,ipcq,67.74749999999585
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),2048,raw,65.18999999999505
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,ipcq,95.74750000000131
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),4096,raw,81.19000000000233
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,ipcq,151.7475000000013
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),8192,raw,113.19000000000233
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,ipcq,179.7475000000013
-h2_intra_vertical,Intra-cube vertical (pe0 to pe4),10240,raw,129.19000000000233
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,ipcq,81.15999999999804
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),128,raw,89.28999999999724
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,ipcq,88.65999999999804
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),256,raw,95.53999999999724
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,ipcq,90.90999999999804
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),384,raw,96.53999999999724
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,ipcq,93.15999999999804
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),512,raw,97.53999999999724
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,ipcq,97.65999999999804
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),768,raw,99.53999999999724
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,ipcq,103.15999999999804
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),1024,raw,102.53999999999724
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,ipcq,125.15999999999804
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),2048,raw,114.53999999999724
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,ipcq,169.15999999999985
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),4096,raw,138.54000000000087
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,ipcq,257.15999999999985
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),8192,raw,186.54000000000087
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,ipcq,301.15999999999985
-h3_inter_cube_horizontal,Inter-cube horizontal (cube0 to cube1),10240,raw,210.54000000000087
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,ipcq,103.15999999999804
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),128,raw,111.28999999999724
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,ipcq,112.65999999999804
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),256,raw,119.53999999999724
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,ipcq,114.90999999999804
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),384,raw,120.53999999999724
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,ipcq,117.15999999999804
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),512,raw,121.53999999999724
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,ipcq,121.65999999999804
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),768,raw,123.53999999999724
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,ipcq,127.15999999999804
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),1024,raw,126.53999999999724
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,ipcq,149.15999999999804
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),2048,raw,138.53999999999724
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,ipcq,193.15999999999985
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),4096,raw,162.54000000000087
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,ipcq,281.15999999999985
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),8192,raw,210.54000000000087
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,ipcq,325.15999999999985
-h4_inter_cube_vertical,Inter-cube vertical (cube0 to cube4),10240,raw,234.54000000000087
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),128,ipcq,24.88749999999891
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),128,raw,33.57999999999811
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),256,ipcq,28.13749999999891
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),256,raw,36.07999999999811
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),384,ipcq,29.88749999999891
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),384,raw,37.07999999999811
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),512,ipcq,31.63749999999891
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),512,raw,38.07999999999811
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),768,ipcq,35.13749999999891
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),768,raw,40.07999999999811
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),1024,ipcq,38.63749999999891
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),1024,raw,42.07999999999811
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),2048,ipcq,52.63749999999891
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),2048,raw,50.07999999999811
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),4096,ipcq,80.63750000000073
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),4096,raw,66.08000000000175
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),8192,ipcq,136.63750000000073
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),8192,raw,98.08000000000175
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),10240,ipcq,164.63750000000073
+latency_intracube_PE0_to_PE1_horizontal,Intra-cube PE-to-PE latency: PE0 → PE1 (horizontal),10240,raw,114.08000000000175
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),128,ipcq,38.49749999999585
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),128,raw,47.18999999999505
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),256,ipcq,43.24749999999585
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),256,raw,51.18999999999505
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),384,ipcq,44.99749999999585
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),384,raw,52.18999999999505
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),512,ipcq,46.74749999999585
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),512,raw,53.18999999999505
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),768,ipcq,50.24749999999585
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),768,raw,55.18999999999505
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),1024,ipcq,53.74749999999585
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),1024,raw,57.18999999999505
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),2048,ipcq,67.74749999999585
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),2048,raw,65.18999999999505
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),4096,ipcq,95.74750000000131
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),4096,raw,81.19000000000233
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),8192,ipcq,151.7475000000013
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),8192,raw,113.19000000000233
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),10240,ipcq,179.7475000000013
+latency_intracube_PE0_to_PE4_vertical,Intra-cube PE-to-PE latency: PE0 → PE4 (vertical),10240,raw,129.19000000000233
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),128,ipcq,81.15999999999804
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),128,raw,89.28999999999724
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),256,ipcq,88.65999999999804
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),256,raw,95.53999999999724
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),384,ipcq,90.90999999999804
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),384,raw,96.53999999999724
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),512,ipcq,93.15999999999804
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),512,raw,97.53999999999724
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),768,ipcq,97.65999999999804
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),768,raw,99.53999999999724
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),1024,ipcq,103.15999999999804
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),1024,raw,102.53999999999724
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),2048,ipcq,125.15999999999804
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),2048,raw,114.53999999999724
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),4096,ipcq,169.15999999999985
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),4096,raw,138.54000000000087
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),8192,ipcq,257.15999999999985
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),8192,raw,186.54000000000087
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),10240,ipcq,301.15999999999985
+latency_intercube_C0PE0_to_C1PE0_horizontal,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube1.PE0 (horizontal),10240,raw,210.54000000000087
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),128,ipcq,103.15999999999804
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),128,raw,111.28999999999724
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),256,ipcq,112.65999999999804
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),256,raw,119.53999999999724
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),384,ipcq,114.90999999999804
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),384,raw,120.53999999999724
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),512,ipcq,117.15999999999804
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),512,raw,121.53999999999724
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),768,ipcq,121.65999999999804
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),768,raw,123.53999999999724
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),1024,ipcq,127.15999999999804
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),1024,raw,126.53999999999724
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),2048,ipcq,149.15999999999804
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),2048,raw,138.53999999999724
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),4096,ipcq,193.15999999999985
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),4096,raw,162.54000000000087
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),8192,ipcq,281.15999999999985
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),8192,raw,210.54000000000087
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),10240,ipcq,325.15999999999985
+latency_intercube_C0PE0_to_C4PE0_vertical,Inter-cube PE-to-PE latency: Cube0.PE0 → Cube4.PE0 (vertical),10240,raw,234.54000000000087
@@ -0,0 +1,836 @@
+# KernBench — Architecture Design Document
+*2026 1H*
+
+KernBench is a system-level, discrete-event simulator for AI-accelerator
+chiplet systems. It models the data-movement and control paths across
+the full hardware hierarchy and reports end-to-end execution latency
+for kernels dispatched to the device's compute units.
+
+This document is a public summary of the architecture as designed and
+implemented in the first half of 2026. It assumes no prior knowledge of
+the simulator's internal documents; terms specific to the system are
+defined on first use.
+
+---
+
+## Design Principles
+
+KernBench is grounded in two foundational commitments: every measured
+latency must trace to explicit, modeled events on the simulator's graph,
+and every behavioral claim must be verifiable through tests that target
+spec-level invariants rather than incidental implementation details.
+
+<!-- src: ADR-0013 Context, Decision -->
+The verification posture is verification-driven. Tests are written to
+validate the architectural contracts that the simulator exposes —
+correct routing, deterministic results, monotonic latency under
+increasing hop counts — rather than to mirror the call graph of the
+implementation. Two phases coexist: a fast timing phase that exercises
+the simulator's discrete-event engine and produces a log of operations
+with timestamps, and an optional data-replay phase that uses that log
+to compute real numerical results. Tests can target either phase.
+
+<!-- src: ADR-0033 Context, Decision -->
+The latency model is intentionally abstract rather than
+cycle-accurate. Each modeled node contributes a configurable per-node
+overhead, each link contributes wire delay plus byte-over-bandwidth
+serialization, and each terminal service contributes its own service
+time. The simulator does not attempt to reproduce cache coherence
+protocols, microarchitectural pipelines, or full PCIe/UCIe protocol
+correctness; those are explicitly outside the scope. The aim is a
+simulator that compares system-level configurations meaningfully and
+deterministically, not one that ships microarchitectural truths.
+
+<!-- src: ADR-0033 Decision, Consequences -->
+Determinism is a hard requirement. Given identical inputs — topology,
+routing policy, and request stream — the simulator must produce
+identical outputs, hop traces included. This rules out reliance on
+unordered set iteration on the critical path and forces every latency
+contribution to come from an explicitly scheduled event on a modeled
+component or link. There are no implicit waits, no hardcoded magic
+delays, and no shortcuts that bypass the modeled graph.
+
+---
+
+## High-level Architecture
+
+<!-- src: ADR-0003 Context, Decision -->
+The simulated system is a four-level hierarchy. A **Tray** holds one or
+more **SIPs** (system-in-package), each containing a 2D mesh of
+**CUBEs** plus one or more **IO chiplets** that connect the SIP to the
+host. Each CUBE contains a regular grid of **PEs** (processing
+elements) plus its own attached resources — high-bandwidth memory
+(HBM), a per-cube SRAM scratchpad, and a management CPU (M_CPU). The PE
+itself is a composite of nine sub-components rather than a monolithic
+core. This hierarchy is fixed; the parameters along each axis (counts,
+mesh dimensions, link widths) are configurable through the topology
+spec.
+
+<!-- src: ADR-0007 Context, Decision -->
+A clean separation runs along the request flow. A **runtime API** at
+the top is the host-facing surface; it exposes tensor and kernel
+operations, owns host-side allocation metadata, and is topology-
+agnostic — it does not route or fan out. Below it the **simulation
+engine** decomposes runtime operations into discrete graph requests
+(memory writes, memory reads, kernel launches, MMU map installs) and
+schedules events deterministically. At the bottom, **components** model
+device behavior on a graph of nodes connected by links; they
+implement the actual latency contributions and pass requests along.
+No component reaches up into the runtime API, and no runtime call
+shortcuts the engine.
+
+<!-- DIAGRAM: Four-level system hierarchy — Tray containing SIPs, each SIP showing its 2D cube mesh and IO chiplet; one cube blown up to show the router mesh, attached PEs, M_CPU, SRAM, and HBM partition. -->
+
+### Tray
+
+<!-- src: ADR-0003 Decision -->
+The Tray is the outermost boundary. It owns the host CPU on one side
+and one or more SIPs on the other, connected through a fabric switch.
+For collective communication that must traverse multiple SIPs, the
+fabric switch acts as the common rendezvous: device-side outbound
+traffic from one SIP routes through the switch and back into the
+target SIP's IO chiplet.
+
+### SIP
+
+<!-- src: ADR-0003 Decision, ADR-0017 Context -->
+A SIP packages a 2D mesh of CUBEs and one or more IO chiplets. The
+default topology used by the simulator is a 4×4 cube mesh; the
+mesh dimensions are configurable. Each cube on the boundary of the
+mesh connects to its neighbors over UCIe (die-to-die) links arranged
+on the four cardinal sides — north, south, east, and west. The IO
+chiplets sit on one side of the SIP and provide the bridge to the host
+across PCIe.
+
+<!-- src: ADR-0016 Context, Decision -->
+The IO chiplet itself contains its own internal network. A
+host-facing PCIe endpoint passes traffic to a small NOC ("network on
+chip"); from there it can branch to a control-plane CPU that processes
+kernel-launch messages, or it can take the direct memory data path to
+the cube's HBM controller. The decision to provide a direct memory
+path that bypasses the control CPU was a deliberate concession to
+keep host-issued memory writes from paying control-plane overhead on
+the data path.
+
+### CUBE
+
+<!-- src: ADR-0017 Decision -->
+Each CUBE owns a 2D mesh of NOC routers and a set of attached
+resources: PEs, the cube-local SRAM scratchpad, the management CPU
+(M_CPU), and the HBM partition (split across multiple PE-private
+slices for bandwidth). The router mesh uses deterministic XY routing.
+Attached components do not connect to each other directly — they all
+sit on the router mesh, and every cube-internal transfer pays the
+mesh distance from source to destination.
+
+<!-- src: ADR-0017 Decision -->
+The HBM partition is per-PE: each PE owns one HBM slice, and the
+controller exposes per-PE channels so that the same PE always
+addresses the same set of HBM channels. This makes the local-HBM
+bandwidth from a PE to its own slice predictable, while accesses to
+another PE's slice — or a different cube's slice — pay the mesh
+distance and any UCIe crossings.
+
+### PE
+
+<!-- src: ADR-0014 Context, Decision -->
+A PE is not a monolithic core. Internally it is a set of nine
+sub-components, each modeling one stage of a request's flow: a small
+control CPU, a tile-pipeline scheduler, a DMA engine, a fetch-store
+engine that moves data between the on-PE scratchpad and the register
+file, a GEMM compute engine, a math compute engine, the tightly-
+coupled memory (TCM, the on-PE scratchpad), an MMU for virtual-to-
+physical address translation, and an inter-PE collective queue
+(IPCQ). The scheduler decomposes higher-level operations into per-tile
+stage sequences, and tile tokens self-route from one sub-component
+to the next.
+
+<!-- DIAGRAM: PE internal layout — the nine sub-components and the edges that connect them; tile token flowing through DMA_READ → FETCH → GEMM → STORE → DMA_WRITE. -->
+
+---
+
+## Detailed Architecture
+
+This section describes each modeled device-side component in turn.
+Components are listed in the alphabetical order used by the
+simulator's source tree.
+
+### forwarding
+
+<!-- src: ADR-0037 Context, Decision -->
+The forwarding component is the generic routing relay used wherever a
+node only needs to apply a small processing overhead and pass the
+request to the next hop. NOC routers, conn nodes, and ucie phys all
+reduce to this. Its first act on receiving a request is to apply the
+per-node overhead configured for it in the topology spec; after the
+overhead it simply hands the request to the next hop along the path.
+
+<!-- src: ADR-0037 Decision, Consequences -->
+The decision to share one implementation across these roles was made
+to keep the simulator's component set small without sacrificing
+modeling fidelity. Each instance still carries its own overhead and
+its own link bandwidth contributions, so different roles still produce
+different timing. What is shared is the dispatcher loop, not the
+parameter values.
+
+### hbm_ctrl
+
+<!-- src: ADR-0034 Context, Decision -->
+The HBM controller is the terminal node for all memory traffic that
+reaches HBM. Internally it owns a number of pseudo channels, partitioned
+per-PE so that each PE addresses a deterministic subset. On a request
+arrival the controller first selects the right pseudo channel from the
+target address, then enters a chunk-loop that drains the requested
+size in fixed-size flits over the channel's bandwidth.
+
+<!-- src: ADR-0034 Decision, Consequences -->
+The chunk-loop pattern replaces an earlier all-at-once drain. The
+benefit is that the controller no longer presents a flit-aware fabric
+with a single bulk transfer; instead it emits flits at a paced rate
+matching the channel bandwidth, which makes cross-flow contention
+visible. The bandwidth budget is calibrated against the configured
+HBM total bandwidth divided across the channel count.
+
+### io_cpu
+
+<!-- src: ADR-0036 Context, Decision -->
+The IO_CPU is the control-plane processor sitting inside the IO chiplet.
+It receives kernel-launch messages from the host, decodes them, and
+dispatches per-cube launches to the cube's management CPU. Pure memory
+operations bypass it entirely, taking the direct data path established
+inside the IO chiplet.
+
+<!-- src: ADR-0036 Decision -->
+On receiving a kernel-launch message, the IO_CPU consults the message's
+shard list — which already names the target SIP, cube, and PE for each
+piece of the tensor argument — and forwards a per-cube launch to each
+cube the kernel needs to reach. This makes the IO_CPU a deterministic
+fan-out point: it does not decode physical addresses to route, it just
+follows the explicit per-shard targets it was handed.
+
+### m_cpu
+
+<!-- src: ADR-0035 Context, Decision -->
+The M_CPU is the cube's management processor. It owns two distinct
+roles: as a control-plane fan-out point for kernel launches arriving
+from the IO chiplet, and as a DMA endpoint for host-initiated memory
+writes that need to land in this cube's HBM. The control role
+forwards launches to the right PE control CPUs; the DMA role places
+the actual bytes into HBM through the router mesh.
+
+<!-- src: ADR-0035 Decision -->
+The component model deliberately distinguishes the two roles because
+their routing differs: the control fan-out path uses command-kind
+links that do not appear on data-path routes, while the DMA path uses
+the same router mesh as PE-initiated DMA, with PE-internal nodes
+excluded. The routing layer knows about both modes and selects the
+appropriate adjacency at request time.
+
+### pcie_ep
+
+<!-- src: ADR-0038 Context, Decision -->
+The PCIE endpoint is the protocol boundary at the host-device edge.
+Its first act on each incoming request is to apply a configured
+protocol-processing overhead; after that it simply forwards. There is
+no internal queuing model, no retry, and no TLP-level fidelity — those
+are deliberately outside scope. The endpoint is bidirectional: host →
+device traffic (memory writes, kernel launches) flows one way, and
+device-side outbound traffic (cross-SIP collective sends) flows the
+other.
+
+<!-- src: ADR-0038 Decision, Alternatives Considered, Consequences -->
+A more detailed PCIe model was considered and rejected. The simulator
+is targeting system-level latency comparisons; making the endpoint
+heavier with credit-management and retry logic would not improve the
+metrics being studied. The decision keeps the endpoint as the
+documented protocol-boundary node, named consistently so routing
+helpers can locate it by SIP and IO instance.
+
+### pe_cpu
+
+<!-- src: ADR-0014 Decision -->
+The PE control CPU is the entry point for kernel work arriving from
+the cube's management CPU. It receives kernel-launch messages, resolves
+the kernel function by name, and hands execution to the scheduler with
+the resolved tensor arguments. From the scheduler's point of view, the
+PE_CPU is the upstream source of high-level commands; from the rest
+of the system's point of view, the PE_CPU is where a kernel's
+execution begins on a given PE.
+
+### pe_dma
+
+<!-- src: ADR-0014 Decision, ADR-0023 Decision -->
+The DMA engine on each PE has two distinct modes. In the standard PE
+pipeline it consumes tile tokens issued by the scheduler, acquires a
+read or write channel (modeled as a one-in-flight resource per
+direction), and runs the bytes to or from HBM through the mesh. In
+its collective mode it forwards send tokens for the cube's IPCQ into
+the fabric, snapshotting the source data at send time so later
+mutations cannot race the receiver's read. Both modes share the same
+channel resources but differ in their downstream handling — one
+returns when the round-trip completes, the other dispatches
+fire-and-forget.
+
+### pe_fetch_store
+
+<!-- src: ADR-0014 Decision -->
+The fetch-store engine is the bridge between the on-PE scratchpad
+(TCM) and the register file. It does not run DMA; it only moves bytes
+internally. On receiving a tile-stage token it sends a short request
+to the TCM, waits for the bandwidth-serialized delay, and continues
+the pipeline. The split between this engine and the TCM lets the
+scratchpad model its own read/write bandwidth independently.
+
+### pe_gemm
+
+<!-- src: ADR-0014 Decision -->
+The GEMM engine is the matrix-multiply compute unit. Tile tokens
+arriving at this stage carry the per-tile dimensions, and the engine
+contributes a service time accounting for one fused multiply-add over
+the tile's macs. Composite operations (where the same tensor pair is
+streamed across many tiles) reuse the engine through the scheduler;
+the engine itself is stateless between tiles.
+
+### pe_ipcq
+
+<!-- src: ADR-0023 Context, Decision -->
+The IPCQ — inter-process communication queue — is each PE's
+collective-communication endpoint. It owns ring buffers that hold
+inbound messages from neighbor PEs and bookkeeping for send credits.
+Direction names ("N", "S", "E", "W" for cube-internal neighbors and
+"global_*" for cross-SIP neighbors) are resolved to physical peer
+endpoints by a neighbor table installed at process-group creation
+time. The component itself does not move bytes — it issues DMA tokens
+through the local PE_DMA, which performs the actual cross-PE
+transfer.
+
+<!-- src: ADR-0023 Decision, Consequences -->
+A key invariant is that the inbound terminal — where data lands at
+the receiver — pays the link bandwidth drain plus any cube-internal
+mesh hop to the slot's backing memory. This prevents IPCQ from
+silently outpacing raw DMA at large transfer sizes. Outbound sends
+are fire-and-forget; credit return is the only backpressure signal.
+
+### pe_math
+
+<!-- src: ADR-0014 Decision -->
+The math engine handles element-wise and reduction operations. It
+consumes tile tokens carrying an operation kind (`exp`, `sum`, `max`,
+`where`, etc.) and contributes a service time proportional to the
+number of elements processed. Like the GEMM engine it is stateless;
+chained epilogues (a sequence of math operations after a GEMM tile)
+are scheduled as separate stages.
+
+### pe_mmu
+
+<!-- src: ADR-0039 Context, Decision -->
+The MMU has two roles, exposed through one component. As a node on
+the cube NOC it receives MMU-map and MMU-unmap messages and updates
+its internal page table, so that the runtime API can install
+virtual-to-physical mappings with measured fabric latency. As a
+utility object held inside the PE it offers synchronous translate
+calls to the PE's DMA and GEMM engines without taking simulator time
+itself; the calling engine pays any configured TLB overhead in its
+own process.
+
+<!-- src: ADR-0039 Decision, Alternatives Considered -->
+The page table supports multiple disjoint regions inside a single
+page, with later-write-wins semantics on overlap. This is a deliberate
+simulator stopgap to support parallelization policies that shard data
+at sub-page granularity without silent mis-routing through a real
+hardware MMU's one-PA-per-entry assumption. A real MMU does not work
+this way; the model documents this as a simplification.
+
+### pe_scheduler
+
+<!-- src: ADR-0014 Decision -->
+The scheduler is the sole dispatcher inside a PE. Simple commands are
+routed directly to the right engine. Composite commands generate a
+tile plan, and the resulting tile tokens are fed into the pipeline.
+Self-routing keeps the scheduler off the per-stage hot path: each
+engine, on finishing a stage, advances the token to the next stage's
+component itself, so the scheduler only does initial dispatch and
+completion tracking.
+
+### pe_tcm
+
+<!-- src: ADR-0040 Context, Decision -->
+The TCM is the per-PE tightly-coupled scratchpad memory. It models
+time only, not data — the actual payload lives in the simulator's
+memory store. Read and write are independent channels: each is
+modeled as a one-in-flight resource, so same-direction requests
+serialize but a read and a write can overlap. The bandwidth of each
+direction is configured separately and applied as bytes-over-bandwidth
+on each request.
+
+<!-- src: ADR-0040 Decision, Alternatives Considered -->
+The decision to keep read and write on separate channels was made
+because the PE pipeline's normal case overlaps fetch (read) and store
+(write). Collapsing them into a single shared channel would have
+artificially serialized that overlap and produced an incorrect
+bandwidth ceiling.
+
+### sram
+
+<!-- src: ADR-0041 Context, Decision -->
+The cube SRAM is a per-cube scratchpad attached to one of the cube's
+routers. As a node it applies a configured access overhead, pays the
+link-bandwidth drain stamped on the incoming request, and sends a
+response on the reverse path. It is a terminal — it does not forward.
+
+<!-- src: ADR-0041 Decision, Consequences -->
+A second role is as one of three backing-memory tiers (TCM, SRAM, HBM)
+that an inter-PE collective slot can live in. When the slot lives in
+SRAM, the PE_DMA pays the slot read or write latency directly using
+the configured SRAM bandwidth and overhead; the SRAM component does
+not need to know about collective semantics. This separation keeps
+the SRAM component agnostic to the collective subsystem.
+
+### tiling
+
+<!-- src: ADR-0042 Context, Decision -->
+The tile-plan generator is not a runtime component — it is a pure
+module of functions that take a problem shape (matrix dimensions, tile
+sizes) and produce an ordered list of tile-stage sequences. The
+scheduler consumes this list. Each tile's stage sequence depends on
+how its operands are staged: operands streamed from HBM produce
+DMA_READ stages, operands already resident in TCM (because they were
+loaded eagerly upfront) skip them.
+
+<!-- src: ADR-0042 Decision, Consequences -->
+The plan generator is intentionally pure — given the same input it
+returns the same plan, with no simulator events created. This lets
+the rest of the system reason about tile sequences as data, and it
+makes the plan testable in isolation without simulator state. New
+plan variants (for example, K-major or DTensor-aware plans) can be
+added as new functions following the same shape.
+
+---
+
+## Implementation Decisions
+
+This section collects cross-cutting decisions — algorithms, policies,
+schemes, and contracts — that span multiple components rather than
+living inside one.
+
+### Address Scheme
+
+<!-- src: ADR-0001 Context, Decision -->
+Every physical address in the simulator decodes into a structured
+location. A fixed-width physical address carries the SIP id, the
+cube id within the SIP, a type discriminator (HBM vs PE-resource vs
+others), and a type-specific offset. HBM addresses additionally encode
+the per-PE slice offset so the controller can determine which PE
+owns the target slice without external lookup. The layout is
+deliberately reserved rather than packed-to-fit, so new sub-units can
+be added at the type-discriminator level without rewriting existing
+addresses.
+
+<!-- src: ADR-0011 Context, Decision -->
+On top of physical addressing, the simulator supports three address
+models that the runtime API selects between. Direct physical
+addressing is retained as a fallback. Virtual addressing — the
+current default — gives each tensor a contiguous virtual range at
+deployment, with the per-PE MMU translating per access; an
+alternative logical-address scheme remains a future option. The
+virtual-address path is what every modern test path takes; the PA
+fallback is used by the MMU itself when no mapping exists for an
+address (a deliberate signal, not an error).
+
+<!-- src: ADR-0011 Decision, Consequences -->
+Tensor placement is represented as a list of physical-address shards,
+each tagged with target SIP, cube, and PE, plus a single tensor-wide
+virtual base. This means a kernel sees one virtual base for the whole
+tensor while the host driver and the engine still know exactly where
+each shard lives. Replicated tensors get per-cube local PA mappings;
+sharded tensors broadcast their mapping across cubes within a SIP.
+
+### Routing, Distance & Helper API
+
+<!-- src: ADR-0002 Context, Decision -->
+Routing is policy-driven, deterministic, and topology-aware. Given a
+source, a destination, and an intent — for example, PE-initiated
+DMA versus host-initiated memory write versus a generic
+component-to-component query — the routing layer picks the right
+path. The intent matters because different traffic types must avoid
+different categories of edges: PE-initiated DMA should not traverse
+command-only links; M_CPU DMA should not pass through PE-internal
+pipeline edges; cube-local transfers should not use the
+zero-distance UCIe bus that would otherwise look attractive to a
+shortest-path search.
+
+<!-- src: ADR-0051 Decision -->
+The routing layer therefore maintains four separate adjacency graphs
+at construction, each excluding a different category of edges, and
+picks the appropriate one per intent. On top of the graphs sits a
+helper API that hides the topology's naming convention: callers ask
+for the PCIe endpoint of a given SIP, the M_CPU of a given cube, or
+the HBM destination for a given physical address, and receive the
+corresponding node id. No component constructs node-id strings
+directly; if the naming convention ever changes, the change is local
+to the helper layer.
+
+<!-- src: ADR-0051 Decision, Consequences -->
+Path-finding itself uses Dijkstra with explicit per-edge weights
+(routing weight is allowed to differ from physical distance — for
+example, UCIe is configured to be routing-preferable). Tie-breaks
+follow insertion order, which keeps results deterministic. Paths
+between unreachable nodes raise rather than returning empty, surfacing
+topology errors immediately.
+
+### Memory Semantics and Local-HBM Bandwidth
+
+<!-- src: ADR-0004 Context, Decision -->
+A PE accessing its own HBM slice through its own cube's NOC must see
+the full local HBM bandwidth — that is the model's intent. Memory
+traffic accumulates latency from per-component overhead and
+bytes-over-link-bandwidth serialization along the path, but the
+controller does not throttle below the slice's allotted bandwidth.
+Cross-PE-slice accesses inside the same cube, cross-cube accesses
+through UCIe, and cross-SIP accesses through PCIe each pay
+progressively more overhead as the path grows.
+
+### Topology Compilation, Diagrams & Builder Algorithms
+
+<!-- src: ADR-0006 Context, Decision -->
+Topology is configurable, not hardcoded. The simulator reads a YAML
+spec, compiles it into a flat graph of nodes and edges plus four
+view projections at different abstraction levels — system, SIP, cube,
+PE — and uses the compiled graph as the single source for both
+execution and visualization. Distance metadata used by routing is
+extracted at compile time so that diagrams and routing decisions
+agree by construction.
+
+<!-- src: ADR-0005 Context, Decision -->
+Diagrams are derived artifacts of the compiled topology. The visualizer
+produces one SVG per view at the appropriate abstraction level; nothing
+in the diagrams is hand-drawn or hand-positioned. Distance-aware
+layout rules place nodes in the diagrams using the same coordinates
+that routing uses to compute distance, so a diagram that "looks
+wrong" is a signal that the topology itself has a problem, not the
+visualizer.
+
+<!-- src: ADR-0053 Decision -->
+Inside a cube the router mesh is generated automatically. PE corner
+positions are fixed by convention; the relay-column algorithm
+inserts additional grid columns whenever the gap between adjacent PE
+columns would exceed a tunable maximum. HBM occupies a central
+exclusion zone — router slots inside the zone are deliberately empty,
+since HBM controllers attach as separate named nodes. M_CPU and SRAM
+attach to the nearest router by Euclidean distance from their
+configured placement coordinates, and UCIe physical lanes distribute
+along the boundary rows and columns. The whole mesh is cached
+beside the topology spec and invalidated only when one of a small set
+of layout-relevant fields changes.
+
+<!-- DIAGRAM: One cube's router mesh — rows × cols of routers with HBM exclusion zone in the middle, PEs/M_CPU/SRAM attaching to nearest routers, UCIe PHYs along the perimeter. -->
+
+### Tensor Deployment and Allocation
+
+<!-- src: ADR-0008 Context, Decision -->
+Tensor deployment in the runtime API produces a list of physical-address
+shards plus a single tensor-wide virtual base. The host allocator
+walks the data-parallelism policy, computes per-shard placement, and
+emits the per-shard physical addresses through the per-PE allocators.
+No separate "allocate then later attach to a device" RPC exists —
+allocation and deployment are a single operation that produces a
+deployed tensor handle.
+
+### Memory Allocator Algorithms
+
+<!-- src: ADR-0048 Context, Decision -->
+Each per-PE allocator owns two channels — HBM slice and TCM — each
+backed by an offset-keyed free-list. Allocation is first-fit; freeing
+coalesces with adjacent free blocks. A device-wide virtual allocator
+sits above the per-PE allocators, aligns requests up to the configured
+page size, and coalesces on free in the same way. The trade-off is
+explicit: first-fit is simpler and cheaper than best-fit or buddy
+allocation, and the simulator's workload is stack-like enough
+(deploy / kernel / free in matched order) that fragmentation is not
+a practical concern.
+
+<!-- src: ADR-0048 Decision, Consequences -->
+Allocation failure raises rather than silently returning a partial
+result. A partial tensor reaching the engine would route over wrong
+PAs and silently corrupt simulator output, so an out-of-memory signal
+is preferred. The free path trusts its caller to pass back exactly
+what was allocated; the small risk of caller error in exchange for
+fast common-case freeing is documented as a deliberate trade.
+
+### Kernel Execution and Host-Device Messaging
+
+<!-- src: ADR-0009 Context, Decision -->
+Kernel execution decomposes into a small set of messages that travel
+the device graph. The host issues a single kernel-launch message; the
+IO_CPU fans it out per-cube; the cube M_CPU fans it out per-PE; the
+PE CPU resolves the kernel and runs it through the scheduler.
+Completion flows back the same way, gated by per-shard completion
+tracking. Memory operations follow the same pattern: a memory write
+or read travels as one message that the engine routes to the right
+HBM controller, with a response taking the reverse path.
+
+<!-- src: ADR-0012 Context, Decision -->
+The schema between the host and the device-side IO CPU is PA-first
+and shard-tagged. Every byte of host-issued payload arrives with an
+explicit target SIP, cube, PE, and physical address. The IO_CPU does
+not decode addresses to derive placement — placement is named
+explicitly by the shard list. This makes the host-device interface
+deterministic and keeps the routing helper free of host-derived
+intent.
+
+### CLI Surface and Semantics
+
+<!-- src: ADR-0010 Context, Decision -->
+The command-line interface exposes four subcommands. A bench runner
+loads a topology, resolves a registered benchmark by name or index,
+and runs it on a selected device. A bench-listing command enumerates
+the registered benchmarks. A probe utility runs a fixed catalog of
+traffic patterns through the engine for latency and bandwidth
+verification. A web viewer renders the topology in a browser. A
+benchmark instance is always single-device by convention; multi-SIP
+collective work happens inside the benchmark through the launcher
+abstraction, not by multiplexing the CLI.
+
+### Component Port and Wire Fabric Model
+
+<!-- src: ADR-0015 Context, Decision -->
+Every modeled component exposes input and output ports, and every
+edge in the topology connects an output port on one component to an
+input port on another. Bandwidth and propagation delay are properties
+of the wire between ports, not of the component endpoints. A
+component's responsibility is to apply its configured per-node
+overhead and either forward to the next hop or terminate; the wire
+charges the byte-over-bandwidth serialization separately.
+
+<!-- src: ADR-0015 Decision, Consequences -->
+This separation lets components be swapped behind their port
+interface without changing the rest of the model, and it keeps
+bandwidth contention at the wire level where multiple components may
+contend for the same edge. Future component models can refine
+internal behavior without disturbing the fabric.
+
+### Two-Pass Data Execution
+
+<!-- src: ADR-0020 Context, Decision -->
+The simulator runs in two passes. The first pass — fast and always
+on — runs the discrete-event engine and records every data operation
+in an operation log with timestamps, component identifiers, and per-
+operation parameters. The second pass — optional, opt-in — replays
+the log against an in-memory tensor store to produce actual numerical
+results. Tests that only need timing skip the second pass; tests that
+need to verify correctness opt in.
+
+<!-- src: ADR-0020 Decision, Consequences -->
+The split lets the timing engine remain unconcerned with data
+semantics: kernels move handles around, not bytes. The replay phase
+recovers data semantics from the recorded operations, in their
+original time order with a small set of secondary-sort rules. The
+op-log records carry enough metadata — input snapshots for compute
+operations, source snapshots for cross-component copies — that the
+replay phase cannot mis-order with respect to in-flight mutations.
+
+### Sim-engine Op Log and Memory Store Schemas
+
+<!-- src: ADR-0052 Context, Decision -->
+The operation log holds typed records with seven fields each: start
+and end timestamps, the component that issued the operation, an
+operation kind ("memory", "gemm", "math"), an operation name, a
+parameter dictionary, and a (currently unused) dependency list.
+Records are kept in stable timestamp order. The parameter dictionary
+varies by operation: a DMA read carries source address and byte count;
+a GEMM carries operand shapes, dtypes, and address spaces; a math
+operation carries input addresses and snapshots.
+
+<!-- src: ADR-0052 Decision, Consequences -->
+The companion memory store is a two-level dictionary keyed by
+address space ("hbm", "tcm", "sram", others) and integer address.
+Reads and writes are reference-based — no copy by default — so
+callers wanting to detach a snapshot must copy explicitly. This is
+deliberate: the engine-internal snapshot paths copy at well-defined
+points (math input capture, HBM source capture for DMA writes,
+inbound collective copies) and downstream replay code therefore
+sees stable data even when slot or scratch addresses are reused by
+later operations.
+
+### 2D Grid Program Identity
+
+<!-- src: ADR-0022 Context, Decision -->
+Inside a kernel the program identity is two-dimensional. The
+first axis corresponds to the PE index within a cube; the second
+corresponds to the cube index within a SIP. Together they let a
+kernel address its position both within its cube and within the
+larger system without needing to know the full topology. Total
+program counts along each axis are exposed symmetrically.
+
+### Parallelism — SIP Launcher, DPPolicy, Megatron-TP, AHBM Backend, and CCL Algorithm Module
+
+<!-- src: ADR-0024 Context, Decision -->
+The launcher model treats each SIP as one rank. Inside a process the
+launcher spawns one greenlet per SIP rank; the rank is bound to its
+greenlet so that any code running in that worker sees the right
+distributed-style rank. This is a deliberately PyTorch-compatible
+shape: a benchmark looks like a small DDP training script — initialize
+a process group, spawn workers, each worker runs the same body.
+
+<!-- src: ADR-0026 Context, Decision -->
+Data-parallelism policy lives in a single object that names the
+sharding strategy along the cube axis (replicate, row-wise,
+column-wise) and along the PE axis (same set of values), and optionally
+overrides the number of cubes or PEs participating. The policy is
+intra-device — it does not cross SIP boundaries. SIP-level parallelism
+is the launcher's responsibility, and the two axes compose
+orthogonally.
+
+<!-- src: ADR-0027 Context, Decision -->
+A Megatron-style tensor-parallel API sits on top of the launcher and
+the DP policy. Layer-level building blocks — column-parallel linear,
+row-parallel linear, all-reduce — name their sharding intent in terms
+the launcher and the placement policy can compose. This is the layer
+that bench code typically writes against.
+
+<!-- src: ADR-0047 Context, Decision -->
+For collective operations the runtime exposes a PyTorch-compatible
+distributed backend named "ahbm". On process-group initialization the
+backend loads the configured collective-algorithm module, resolves
+the world size (priority: explicit ccl.yaml override → defaults
+section → topology SIP count), imports the algorithm module
+dynamically, derives the SIP topology kind, and pushes the inter-PE
+neighbor table to every participating PE. From that point on, an
+all-reduce call dispatches the algorithm's kernel function across
+all ranks.
+
+<!-- src: ADR-0050 Context, Decision -->
+A collective-algorithm module is a Python module with a small, fixed
+contract. It exposes topology-kind integer constants, a name-to-kind
+mapping for the YAML configuration, a kernel-arguments builder, and
+a kernel function — the kernel function being aliased to the name
+`kernel` so the backend can find it generically. The kernel itself
+takes the tensor pointer, the per-cube element count, cube mesh
+width and height, the world size, the current rank, and the SIP
+topology dimensions; the backend appends those last four arguments
+automatically. New collectives slot in by adding a new module that
+follows this shape.
+
+<!-- src: ADR-0027 Decision, Consequences -->
+The combination is deliberate: bench authors get to write code that
+looks like a regular distributed training script, while the launcher,
+backend, and placement policies behind it remain free to redirect
+work to the right SIP, cube, and PE without exposing topology to the
+kernel.
+
+### IPCQ Direction Addressing
+
+<!-- src: ADR-0025 Context, Decision -->
+Inside a collective algorithm, peer PEs are named by direction —
+"N", "S", "E", "W" for cube-internal neighbors, and "global_*" for
+cross-SIP neighbors. Direction addressing is the addressing scheme:
+the algorithm names a direction, the IPCQ neighbor table installed
+at process-group time resolves the direction to the peer endpoint's
+physical-address coordinates, and the PE_DMA performs the actual
+transfer. The algorithm itself does not see PA arithmetic — direction
+is the user-facing handle.
+
+### Intercube All-Reduce
+
+<!-- src: ADR-0032 Context, Decision -->
+The default all-reduce algorithm uses a center-rooted bidirectional
+phase inside each SIP's cube mesh followed by an inter-SIP exchange
+on the mesh's root cube, and then a bidirectional broadcast back
+out. Center-rooting halves the in-cube hop count compared with a
+corner-rooted walk. The inter-SIP exchange itself follows the
+configured SIP topology — ring, torus, or non-wrapping mesh —
+selected at runtime through the SIP-topology kind integer the
+backend passes to the kernel.
+
+### Evaluation Harnesses
+
+<!-- src: ADR-0043 Context, Decision -->
+The all-reduce evaluation harness drives correctness and the
+latency/buffer-kind sweeps through the public distributed path —
+initialize process group, spawn workers, call all-reduce — rather
+than the lower-level engine interface. A shared helper module factors
+out the setup; sweep tests cover the buffer-kind tiers (TCM, SRAM,
+HBM) and the inter-SIP topology variants. The plots produced by the
+harness are part of its output contract; the harness regenerates them
+on demand.
+
+<!-- src: ADR-0044 Context, Decision -->
+The GEMM evaluation harness is split into two layers. A heavy
+shape-and-variant sweep lives as a manual script — it runs the same
+composite-GEMM benchmark across many shapes and operand-staging
+variants, harvests the resulting op-log, and writes a JSON summary.
+A faster figure-generation layer lives in the test suite and consumes
+that JSON to render plots. The split keeps the heavy data
+generation explicit and out of the regular test path.
+
+### Bench Module Contract
+
+<!-- src: ADR-0045 Context, Decision -->
+Adding a new benchmark requires only dropping a file into the
+benchmarks directory. The file registers one or more benchmark
+functions through a small decorator that takes a kebab-case name and
+a human-readable description. The decorator is the registration
+mechanism — there is no separate manifest. Each benchmark function
+takes one argument, conventionally named `torch`, which is the
+runtime context exposing tensor allocation, kernel launch,
+distributed APIs, and process-spawning. The function name is `run` by
+convention.
+
+<!-- src: ADR-0045 Decision, Consequences -->
+A benchmark must submit at least one operation, or the runner
+returns an error. A benchmark instance is single-device by default;
+when a benchmark is collective, it uses the distributed-process-spawn
+pattern internally — one worker greenlet per rank, with each worker
+binding to its rank. Multi-device benchmark patterns outside that
+shape are not supported.
+
+### Kernel-side `tl.*` API
+
+<!-- src: ADR-0046 Context, Decision -->
+Inside a kernel function, the `tl` argument exposes the kernel-side
+API in a shape that mirrors the conventions of established
+GPU-kernel languages. Categories: reference handles that name HBM
+data without issuing DMA; data movement (load, store) that does
+issue DMA; GEMM and math compute (dot, composite, the unary and
+binary math operations, reductions); index and scalar helpers
+(program identity, range-builders); metadata-only operations like
+transpose; and the collective primitives (send, receive,
+non-blocking receive). Tensor handles support arithmetic operators
+via a thread-local active context so kernel code reads naturally.
+
+<!-- src: ADR-0046 Decision, Consequences -->
+The API supports two execution modes. A command-list mode records
+operations into a list without consuming simulator time — useful for
+inspection and lightweight tests. A greenlet-driven mode runs the
+kernel as a child greenlet that switches back to the simulator on
+each `tl.*` call; the simulator drives the event scheduler and hands
+real data back to the kernel as DMA reads complete. The two modes
+share the same surface; the kernel does not know which one it is
+running under.
+
+### Probe Subcommand
+
+<!-- src: ADR-0049 Context, Decision -->
+The probe utility runs three families of traffic patterns through
+the engine — host-to-device writes at increasing hop counts,
+device-to-host reads at increasing hop counts, and PE-initiated DMA
+across the cube mesh — and reports actual latency, the analytical
+formula breakdown, effective bandwidth, bottleneck bandwidth, and
+utilization. A fixed reference size is used for the summary table;
+a separate utilization-versus-size sweep covers a logarithmic range
+of transfer sizes. Each case runs in its own engine instance so
+cases do not perturb each other.
+
+<!-- src: ADR-0049 Decision, Consequences -->
+The probe also checks a small set of invariants automatically:
+monotonic latency increase with hop count, device-to-host latency
+at least as large as host-to-device for the same hop count, and a
+faster best-case path than worst-case for cross-cube PE DMA. Failures
+print prominently. The output is meant for human reading; automated
+parsing should not depend on column widths or whitespace.
+
+---
+
+This document summarizes 46 architecture decisions captured during
+the first half of 2026. It is regenerated mechanically from the
+decision corpus; sources are recorded in HTML comments throughout.
@@ -12,8 +12,8 @@ dependencies = ["pytest", "simpy", "pyyaml", "fastapi>=0.110", "uvicorn[standard
 kernbench = "kernbench.cli.main:main"

 [tool.setuptools.packages.find]
-where = ["src", "."]
-include = ["kernbench*", "benches*"]
+where = ["src"]
+include = ["kernbench*"]

 [project.optional-dependencies]
 dev = [
@@ -4,8 +4,8 @@ Slides:
  1. Overall architecture — how PEs are connected (cube_mesh_view)
  2. Model correctness — DMA vs P2P latency (pe2pe overview)
  3. PE-to-PE IPCQ communication (ipcq_two_pe_dma)
-  4. 6-device allreduce — model vs theoretical vs ext-sim (overview_broken)
-  5. IPCQ buffer-kind sweep — TCM vs SRAM vs HBM (buffer_kind_sweep)
+  4. 6-device allreduce — model vs theoretical vs FSIM (comparison_…_fsim)
+  5. IPCQ buffer-kind sweep — TCM vs SRAM vs HBM (…_with_TCM_SRAM_HBM)
  6. PE_accelerator data path (composite GEMM pipeline structure)
  7. matmul(32, 128, 32) — composite GEMM execution sequence
  8. matmul(32, 128, 128) — pipeline scaling and HBM contention
@@ -63,7 +63,7 @@ SLIDES = [
    },
    {
        "title": "4. 6-Device Allreduce: Model vs Theoretical vs External Simulator",
-        "image": DIAG / "allreduce_latency_plots" / "overview_broken.png",
+        "image": DIAG / "allreduce_latency_plots" / "comparison_mesh_vs_ring_vs_2DTorus_vs_theoretical_vs_fsim.png",
        "bullets": [
            "Three SIP topologies (ring / torus / mesh) swept 16 B → 96 KB per PE",
            "Dashed red curve: hand-derived theoretical model for torus_2d (6 SIPs)",
@@ -73,7 +73,7 @@ SLIDES = [
    },
    {
        "title": "5. IPCQ Slot Memory: TCM vs SRAM vs HBM",
-        "image": DIAG / "allreduce_latency_plots" / "buffer_kind_sweep.png",
+        "image": DIAG / "allreduce_latency_plots" / "AllReduce_LRAB_2Dtorus_6SiP_2x3_with_TCM_SRAM_HBM.png",
        "bullets": [
            "Same allreduce with slot memory swapped: TCM (per-PE local) / SRAM / HBM (cube-shared, behind router link)",
            "Cost = NoC drain + slot-IO + PE↔bank hop; only TCM skips the bank hop",
@@ -1,192 +0,0 @@
-"""One-shot: render overview.png with an external 366 µs reference, in two
-variants — log scale and broken y-axis. Reads docs/diagrams/allreduce_latency_plots/summary.csv
-and writes overview_log.png and overview_broken.png alongside it.
-
-This is a derived-artifact generator (per CLAUDE.md): plotting only, no production
-or test logic touched.
-"""
-from __future__ import annotations
-
-import csv
-from pathlib import Path
-
-import matplotlib.pyplot as plt
-import matplotlib.ticker as mticker
-
-ROOT = Path(__file__).resolve().parent.parent
-PLOT_DIR = ROOT / "docs" / "diagrams" / "allreduce_latency_plots"
-CSV_PATH = PLOT_DIR / "summary.csv"
-
-EXT_LABEL = "ext-sim single-device reduce: 366 µs"
-EXT_LATENCY_NS = 366_000.0
-
-COLORS = {
-    "ring_1d": "tab:blue",
-    "torus_2d": "tab:orange",
-    "mesh_2d_no_wrap": "tab:green",
-}
-
-# Hand-derived theoretical model for torus_2d (6 SIPs). Mirrors
-# _aggregate_sweep_plots in tests/test_allreduce_multidevice.py.
-NOC_PACKET_BYTES = 128
-PES_PER_CUBE = 8
-T_STARTUP_NS = 1346.0
-TAU_NS = (8741.0 - 1346.0) / (6144 - 1)
-
-
-def _theoretical_torus_2d_ns(bytes_per_pe: int) -> float:
-    bytes_per_cube = int(bytes_per_pe) * PES_PER_CUBE
-    n_packets = max(1, -(-bytes_per_cube // NOC_PACKET_BYTES))
-    return T_STARTUP_NS + (n_packets - 1) * TAU_NS
-
-
-def _plot_theoretical(ax, records):
-    torus_rs = sorted(
-        [r for r in records if r["sip_topology"] == "torus_2d"],
-        key=lambda r: r["bytes_per_pe"],
-    )
-    if not torus_rs:
-        return
-    ax.plot(
-        [r["bytes_per_pe"] for r in torus_rs],
-        [_theoretical_torus_2d_ns(r["bytes_per_pe"]) for r in torus_rs],
-        color="tab:red", linestyle="--", linewidth=1.6, marker="x",
-        label="theoretical torus_2d (6 SIPs)",
-    )
-
-
-def _bytes_fmt(x, _pos):
-    if x >= 1024 * 1024:
-        return f"{x / (1024 * 1024):.0f}M"
-    if x >= 1024:
-        return f"{x / 1024:.0f}K"
-    return f"{int(x)}"
-
-
-def _load_records():
-    rows = []
-    with open(CSV_PATH, newline="") as f:
-        r = csv.DictReader(f)
-        for row in r:
-            rows.append({
-                "sip_topology": row["sip_topology"],
-                "bytes_per_pe": int(row["bytes_per_pe"]),
-                "latency_ns": float(row["latency_ns"]),
-            })
-    return rows
-
-
-def _ext_x(records):
-    """Anchor the external reference at the largest payload (96 KB / PE)."""
-    return max(r["bytes_per_pe"] for r in records)
-
-
-def _plot_curves(ax, records, topologies):
-    for topo in topologies:
-        rs = sorted([r for r in records if r["sip_topology"] == topo],
-                    key=lambda r: r["bytes_per_pe"])
-        if not rs:
-            continue
-        ax.plot(
-            [r["bytes_per_pe"] for r in rs],
-            [r["latency_ns"] for r in rs],
-            marker="o",
-            label=f"{topo}",
-            color=COLORS.get(topo),
-        )
-
-
-def emit_log(records):
-    topologies = sorted({r["sip_topology"] for r in records})
-    fig, ax = plt.subplots(figsize=(9, 6))
-    _plot_curves(ax, records, topologies)
-    _plot_theoretical(ax, records)
-    ax.scatter(
-        [_ext_x(records)], [EXT_LATENCY_NS],
-        marker="*", s=220, color="tab:red", zorder=5,
-        label=EXT_LABEL,
-    )
-    ax.set_xscale("log", base=2)
-    ax.set_yscale("log")
-    ax.set_xlabel("Bytes per PE (log scale)")
-    ax.set_ylabel("Time (ns) — log scale")
-    ax.set_title("Multi-device allreduce latency vs external single-device reference")
-    ax.grid(True, which="both", alpha=0.3)
-    ax.xaxis.set_major_formatter(mticker.FuncFormatter(_bytes_fmt))
-    ax.legend(loc="upper left")
-    fig.tight_layout()
-    out = PLOT_DIR / "overview_log.png"
-    fig.savefig(out, dpi=120)
-    plt.close(fig)
-    print(f"wrote {out}")
-
-
-def emit_broken(records):
-    topologies = sorted({r["sip_topology"] for r in records})
-    max_local = max(r["latency_ns"] for r in records)
-
-    fig, (ax_top, ax_bot) = plt.subplots(
-        2, 1, sharex=True,
-        gridspec_kw={"height_ratios": [1, 4], "hspace": 0.05},
-        figsize=(9, 6.5),
-    )
-
-    # Bottom panel: today's three curves + theoretical, linear y.
-    _plot_curves(ax_bot, records, topologies)
-    _plot_theoretical(ax_bot, records)
-    ax_bot.set_ylim(0, max_local * 1.10)
-
-    # Top panel: only the external reference marker, linear y around 366 µs.
-    ax_top.scatter(
-        [_ext_x(records)], [EXT_LATENCY_NS],
-        marker="*", s=240, color="tab:red", zorder=5,
-        label=EXT_LABEL,
-    )
-    ax_top.set_ylim(EXT_LATENCY_NS * 0.93, EXT_LATENCY_NS * 1.05)
-
-    # Hide the spine between the two panels and draw diagonal "break" ticks.
-    ax_top.spines["bottom"].set_visible(False)
-    ax_bot.spines["top"].set_visible(False)
-    ax_top.tick_params(labeltop=False, bottom=False)
-    ax_bot.xaxis.tick_bottom()
-
-    d = 0.012  # diagonal-tick size, in axis-fraction
-    kw = dict(transform=ax_top.transAxes, color="k", clip_on=False, lw=1)
-    ax_top.plot((-d, +d), (-d, +d), **kw)
-    ax_top.plot((1 - d, 1 + d), (-d, +d), **kw)
-    kw.update(transform=ax_bot.transAxes)
-    ax_bot.plot((-d, +d), (1 - d * 4, 1 + d * 4), **kw)
-    ax_bot.plot((1 - d, 1 + d), (1 - d * 4, 1 + d * 4), **kw)
-
-    ax_bot.set_xscale("log", base=2)
-    ax_bot.set_xlabel("Bytes per PE (log scale)")
-    ax_bot.set_ylabel("Time (ns)")
-    ax_top.set_ylabel("Time (ns)")
-    ax_bot.grid(True, alpha=0.3)
-    ax_top.grid(True, alpha=0.3)
-    ax_bot.xaxis.set_major_formatter(mticker.FuncFormatter(_bytes_fmt))
-
-    # One legend covering both axes.
-    handles_bot, labels_bot = ax_bot.get_legend_handles_labels()
-    handles_top, labels_top = ax_top.get_legend_handles_labels()
-    ax_bot.legend(handles_bot + handles_top, labels_bot + labels_top,
-                  loc="upper left")
-
-    fig.suptitle("Multi-device allreduce latency vs external single-device reference (broken y-axis)")
-    fig.tight_layout()
-    out = PLOT_DIR / "overview_broken.png"
-    fig.savefig(out, dpi=120)
-    plt.close(fig)
-    print(f"wrote {out}")
-
-
-def main():
-    records = _load_records()
-    if not records:
-        raise SystemExit(f"no rows in {CSV_PATH}")
-    emit_log(records)
-    emit_broken(records)
-
-
-if __name__ == "__main__":
-    main()
@@ -1,237 +1,20 @@
 """Sweep GEMM shapes through kernbench and dump PE_accelerator engine times.

-For each shape:
-  - run benches.matmul_composite via the same run_bench path the CLI uses
-  - read result.engine.op_log
-  - filter to per-PE engines: pe_dma, pe_fetch_store, pe_gemm, pe_math
-  - record sum-of-durations (engine occupancy) AND wall-clock active interval
+Thin wrapper: the sweep logic now lives in
+``kernbench.benches.milestone_1h_gemm`` (the single home, ADR-0054, also the
+``milestone-1h-gemm`` bench). This script remains the manual entry point for
+regenerating ``docs/diagrams/gemm_sweep.json`` on demand and honors the same
+``SWEEP_SHAPES`` / ``SWEEP_TOPOLOGY`` env overrides.

-Output: docs/diagrams/gemm_sweep.json
+    python scripts/gemm_sweep.py
 """
 from __future__ import annotations

-import json
-import os
-import sys
-import time
-from pathlib import Path
-
-# Default sweep covering under-tile, single-tile, multi-tile, and asymmetric regimes.
-# Each entry is either a single integer (square M=K=N=S) or "MxKxN".
-# Override via env: SWEEP_SHAPES="16,32,16x2048x16,..."
-DEFAULT_SHAPES = [
-    "32x32x32",       # 1 tile, K=32 < TILE_K=64 → under-tile in K
-    "32x64x32",       # 1 tile, exact single-tile fit
-    "32x128x32",      # 2 tiles, aligned
-    "32x128x128",     # 8 tiles, aligned
-    "32x3072x32",     # 48 tiles, all K-axis (tall-skinny)
-    "8x128x128",      # 8 tiles, but M=8 < TILE_M=32 → MAC array under-fed
-    "128x8x128",      # 16 tiles, but K=8 < TILE_K=64 → MAC array under-fed
-    "512",            # 2048 tiles, fully aligned — "well-pipelined" reference
-]
-
-# Operand-staging variants exercised per shape.
-VARIANTS = ["ref_ref", "load_ref", "load_load"]
-
-# Engines whose timings we collect (component_id suffix match).
-ENGINES = ["pe_dma", "pe_fetch_store", "pe_gemm", "pe_math"]
-
-# Per-stage breakdown labels (StageType enum names from pe_types.py).
-STAGES = ["DMA_READ", "DMA_WRITE", "FETCH", "STORE", "GEMM", "MATH"]
-
-# Scheduler tile sizes (mirror of PeSchedulerComponent.TILE_M/K/N).
-TILE_M, TILE_K, TILE_N = 32, 64, 32
-
-OUT_PATH = Path(__file__).parent.parent / "docs" / "diagrams" / "gemm_sweep.json"
-
-
-def _engine_wall_ns(records, suffix: str) -> float:
-    """Wall-clock interval the engine was active (union of overlapping ops)."""
-    intervals = [(r.t_start, r.t_end) for r in records
-                 if r.component_id.endswith("." + suffix)]
-    if not intervals:
-        return 0.0
-    intervals.sort()
-    merged_end = intervals[0][1]
-    merged_start = intervals[0][0]
-    total = 0.0
-    for s, e in intervals[1:]:
-        if s <= merged_end:
-            merged_end = max(merged_end, e)
-        else:
-            total += merged_end - merged_start
-            merged_start, merged_end = s, e
-    total += merged_end - merged_start
-    return total
-
-
-def _engine_occupancy_ns(records, suffix: str) -> float:
-    return sum(r.t_end - r.t_start for r in records
-               if r.component_id.endswith("." + suffix))
-
-
-def _engine_count(records, suffix: str) -> int:
-    return sum(1 for r in records if r.component_id.endswith("." + suffix))
-
-
-def _stage_occupancy_ns(records, stage_type: str) -> float:
-    """Sum t_end - t_start over op_log records whose params.stage_type matches.
-
-    Requires op_log records produced post the TileToken stage_type capture
-    (sim_engine/op_log.py).
-    """
-    return sum(
-        r.t_end - r.t_start
-        for r in records
-        if r.params.get("stage_type") == stage_type
-    )
-
-
-def _stage_wall_ns(records, stage_type: str) -> float:
-    """Interval-union wall-clock for records whose stage_type matches."""
-    intervals = sorted(
-        (r.t_start, r.t_end) for r in records
-        if r.params.get("stage_type") == stage_type
-    )
-    if not intervals:
-        return 0.0
-    total = 0.0
-    cs, ce = intervals[0]
-    for s, e in intervals[1:]:
-        if s <= ce:
-            ce = max(ce, e)
-        else:
-            total += ce - cs
-            cs, ce = s, e
-    total += ce - cs
-    return total
-
-
-def _stage_count(records, stage_type: str) -> int:
-    return sum(1 for r in records if r.params.get("stage_type") == stage_type)
-
-
-def _run_one(M: int, K: int, N: int, topology: str, variant: str = "ref_ref") -> dict:
-    os.environ["MATMUL_M"] = str(M)
-    os.environ["MATMUL_K"] = str(K)
-    os.environ["MATMUL_N"] = str(N)
-    os.environ["MATMUL_VARIANT"] = variant
-
-    # Late imports so env vars are read by benches/matmul_composite at module load.
-    # Force re-import to pick up new env values.
-    for mod_name in [m for m in list(sys.modules) if m.startswith("benches.matmul_composite")]:
-        del sys.modules[mod_name]
-
-    from benches.loader import resolve_bench
-    from kernbench.runtime_api.bench_runner import run_bench
-    from kernbench.runtime_api.types import resolve_device
-    from kernbench.sim_engine.engine import GraphEngine
-    from kernbench.topology.builder import resolve_topology
-
-    topo = resolve_topology(topology)
-    bench = resolve_bench("matmul_composite")
-    device = resolve_device(None)
-
-    t0 = time.time()
-    result = run_bench(
-        topology=topo, bench_fn=bench, device=device,
-        engine_factory=lambda t, d: GraphEngine(
-            getattr(t, "topology_obj", t), enable_data=True,
-        ),
-    )
-    wall = time.time() - t0
-
-    op_log = result.engine.op_log
-    if not result.completion.ok:
-        raise RuntimeError(f"bench failed at M={M},K={K},N={N}: {result.completion}")
-
-    # Bytes touched at f16 (2 B): full A + full B + full out (each operand
-    # streamed once through HBM by the composite plan).
-    bytes_total = (M * K + K * N + M * N) * 2
-    row = {
-        "M": M, "K": K, "N": N,
-        "variant": variant,
-        "flops": 2 * M * K * N,
-        "bytes_hbm": bytes_total,
-        "arith_intensity": (2 * M * K * N) / bytes_total,  # flops/byte
-        "tile_count_expected": _ceil(M, TILE_M) * _ceil(N, TILE_N) * _ceil(K, TILE_K),
-        "sim_wall_clock_s": round(wall, 3),
-        "engines": {},
-    }
-    for eng in ENGINES:
-        row["engines"][eng] = {
-            "occupancy_ns": _engine_occupancy_ns(op_log, eng),
-            "wall_ns":      _engine_wall_ns(op_log, eng),
-            "record_count": _engine_count(op_log, eng),
-        }
-    row["stages"] = {}
-    for stage in STAGES:
-        row["stages"][stage] = {
-            "occupancy_ns": _stage_occupancy_ns(op_log, stage),
-            "wall_ns":      _stage_wall_ns(op_log, stage),
-            "record_count": _stage_count(op_log, stage),
-        }
-    # Kernel-window wall-clock = max t_end - min t_start over PE engine records.
-    pe_records = [r for r in op_log
-                  if any(r.component_id.endswith("." + e) for e in ENGINES)]
-    if pe_records:
-        row["pe_window_ns"] = max(r.t_end for r in pe_records) \
-                              - min(r.t_start for r in pe_records)
-    else:
-        row["pe_window_ns"] = 0.0
-    stage_records = [r for r in op_log
-                     if r.params.get("stage_type") in STAGES]
-    if stage_records:
-        row["composite_window_ns"] = max(r.t_end for r in stage_records) \
-                                     - min(r.t_start for r in stage_records)
-    else:
-        row["composite_window_ns"] = 0.0
-    return row
-
-
-def _ceil(a: int, b: int) -> int:
-    return (a + b - 1) // b
+from kernbench.benches.milestone_1h_gemm import run_sweep


 def main() -> int:
-    shapes_env = os.environ.get("SWEEP_SHAPES")
-    raw = (shapes_env.split(",") if shapes_env else DEFAULT_SHAPES)
-    shapes: list[tuple[int, int, int]] = []
-    for s in raw:
-        s = s.strip()
-        if not s:
-            continue
-        if "x" in s.lower():
-            parts = s.lower().split("x")
-            shapes.append((int(parts[0]), int(parts[1]), int(parts[2])))
-        else:
-            v = int(s)
-            shapes.append((v, v, v))
-    topology = os.environ.get("SWEEP_TOPOLOGY", "topology.yaml")
-
-    rows = []
-    for M, K, N in shapes:
-        for variant in VARIANTS:
-            print(f"[sweep] M={M} K={K} N={N} variant={variant} ...", flush=True)
-            row = _run_one(M, K, N, topology, variant=variant)
-            rows.append(row)
-            eng_dma = row["engines"]["pe_dma"]
-            eng_gem = row["engines"]["pe_gemm"]
-            print(f"   tiles={row['tile_count_expected']:>6}  "
-                  f"pe_window={row['pe_window_ns']:8.1f}ns  "
-                  f"dma_occ={eng_dma['occupancy_ns']:9.1f}  "
-                  f"gemm_occ={eng_gem['occupancy_ns']:8.1f}  "
-                  f"(sim {row['sim_wall_clock_s']:.1f}s)")
-
-    OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
-    OUT_PATH.write_text(json.dumps({
-        "tile_sizes": {"M": TILE_M, "K": TILE_K, "N": TILE_N},
-        "engines": ENGINES,
-        "stages": STAGES,
-        "variants": VARIANTS,
-        "rows": rows,
-    }, indent=2))
-    print(f"\n[sweep] wrote {OUT_PATH}")
+    run_sweep()
    return 0


@@ -0,0 +1,13 @@
+buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns
+hbm,torus_2d,6,128,256,2120.040000000012
+hbm,torus_2d,6,1024,2048,2717.2783333333473
+hbm,torus_2d,6,8192,16384,7315.184999999989
+hbm,torus_2d,6,32768,65536,23081.26500000037
+sram,torus_2d,6,128,256,2060.040000000012
+sram,torus_2d,6,1024,2048,2909.2783333333473
+sram,torus_2d,6,8192,16384,9523.184999999869
+sram,torus_2d,6,32768,65536,32201.265000000385
+tcm,torus_2d,6,128,256,1964.040000000012
+tcm,torus_2d,6,1024,2048,2477.2783333333473
+tcm,torus_2d,6,8192,16384,6403.185000000109
+tcm,torus_2d,6,32768,65536,19865.265000000378
@@ -0,0 +1,37 @@
+algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,8,16,256,2666.552500000015
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,32,64,1024,2747.7400000000152
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,64,128,2048,2855.990000000018
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,128,256,4096,3072.490000000019
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3337.1133333333582
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3708.0333333333692
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4449.873333333393
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,5933.020000000124
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,8900.379999999863
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,14835.099999999224
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,26704.540000000765
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,38573.97999999701
+lrab_hierarchical_allreduce,ring_1d,6,8,16,256,2365.255833333347
+lrab_hierarchical_allreduce,ring_1d,6,32,64,1024,2436.9433333333473
+lrab_hierarchical_allreduce,ring_1d,6,64,128,2048,2532.526666666683
+lrab_hierarchical_allreduce,ring_1d,6,128,256,4096,2723.693333333349
+lrab_hierarchical_allreduce,ring_1d,6,512,1024,16384,3048.635000000021
+lrab_hierarchical_allreduce,ring_1d,6,1024,2048,32768,3393.4016666666957
+lrab_hierarchical_allreduce,ring_1d,6,2048,4096,65536,4082.401666666714
+lrab_hierarchical_allreduce,ring_1d,6,4096,8192,131072,5458.80166666677
+lrab_hierarchical_allreduce,ring_1d,6,8192,16384,262144,8216.934999999943
+lrab_hierarchical_allreduce,ring_1d,6,16384,32768,524288,13733.201666665835
+lrab_hierarchical_allreduce,ring_1d,6,32768,65536,1048576,24765.73500000064
+lrab_hierarchical_allreduce,ring_1d,6,49152,98304,1572864,35798.268333331536
+lrab_hierarchical_allreduce,torus_2d,6,8,16,256,1700.6025000000095
+lrab_hierarchical_allreduce,torus_2d,6,32,64,1024,1753.2900000000102
+lrab_hierarchical_allreduce,torus_2d,6,64,128,2048,1823.540000000012
+lrab_hierarchical_allreduce,torus_2d,6,128,256,4096,1964.040000000012
+lrab_hierarchical_allreduce,torus_2d,6,512,1024,16384,2196.8183333333463
+lrab_hierarchical_allreduce,torus_2d,6,1024,2048,32768,2477.2783333333473
+lrab_hierarchical_allreduce,torus_2d,6,2048,4096,65536,3038.1983333333583
+lrab_hierarchical_allreduce,torus_2d,6,4096,8192,131072,4159.5050000000665
+lrab_hierarchical_allreduce,torus_2d,6,8192,16384,262144,6403.185000000109
+lrab_hierarchical_allreduce,torus_2d,6,16384,32768,524288,10890.5449999995
+lrab_hierarchical_allreduce,torus_2d,6,32768,65536,1048576,19865.265000000378
+lrab_hierarchical_allreduce,torus_2d,6,49152,98304,1572864,28839.98500000059
@@ -0,0 +1,65 @@
+{
+  "version": 1,
+  "validation_scale": true,
+  "panels": [
+    "single_user_prefill",
+    "multi_user_prefill",
+    "single_user_decode",
+    "multi_user_decode"
+  ],
+  "config": {
+    "S_q_prefill": 16,
+    "S_kv_per_rank": 16,
+    "h_q": 1,
+    "h_kv": 1,
+    "d_head": 64,
+    "n_ranks_single_user": 8,
+    "n_ranks_multi_user": 4
+  },
+  "rows": [
+    {
+      "panel": "single_user_prefill",
+      "n_ranks": 8,
+      "op_log_summary": {
+        "gemm_count": 128,
+        "ipcq_send_count": 112,
+        "ipcq_recv_count": 112,
+        "dma_read_count": 24,
+        "dma_write_count": 8
+      }
+    },
+    {
+      "panel": "multi_user_prefill",
+      "n_ranks": 4,
+      "op_log_summary": {
+        "gemm_count": 32,
+        "ipcq_send_count": 24,
+        "ipcq_recv_count": 24,
+        "dma_read_count": 12,
+        "dma_write_count": 4
+      }
+    },
+    {
+      "panel": "single_user_decode",
+      "n_ranks": 8,
+      "op_log_summary": {
+        "gemm_count": 16,
+        "ipcq_send_count": 168,
+        "ipcq_recv_count": 168,
+        "dma_read_count": 24,
+        "dma_write_count": 8
+      }
+    },
+    {
+      "panel": "multi_user_decode",
+      "n_ranks": 4,
+      "op_log_summary": {
+        "gemm_count": 8,
+        "ipcq_send_count": 36,
+        "ipcq_recv_count": 36,
+        "dma_read_count": 12,
+        "dma_write_count": 4
+      }
+    }
+  ]
+}
@@ -0,0 +1,9 @@
+"""kernbench.benches: eager-import sibling modules so @bench fires.
+
+Underscore-prefixed modules are treated as helpers and skipped.
+After import, every imported module must have registered at least one
+bench, or a RuntimeError is raised by the audit.
+"""
+from kernbench.benches.registry import _eager_import_and_audit
+
+_eager_import_and_audit(__path__, __name__)
--- a/Show More
+++ b/Show More