attention: add rank_axis kwarg to mesh kernels for multi_user cube ring

ADR-0059 single_user_* panels run the ring across PEs in one cube (rank == tl.program_id(axis=0)). multi_user_* panels run the ring across cubes — rank should be cube_id (axis=1), and 7 of every 8 PEs in each cube must stay silent because the cube-level SFR install only gives the cube-coordinate PE 0 an E/W neighbor. Add ``rank_axis: int = 0`` kwarg to both ``attention_mesh_mlo_kernel`` and ``attention_mesh_kv_kernel``: - 0 (default): rank == tl.program_id(axis=0). Existing single_user behavior, all spec tests unchanged. - 1: gate ``if tl.program_id(axis=0) != 0: return`` at kernel start, then ``rank = tl.program_id(axis=1)``. multi_user_* panels pass this to the kernel via ctx.launch positional arg. Also brings in _attention_mesh_kv.py and _attention_mesh_mlo.py as the committed home of the ADR-0059 kernels (previously living uncommitted in the working tree from sub-cycle 4b). Tests: 7-test rank_axis spec file (default-path + rank_axis=1 gating and cube-id semantics, both kernels); 4-panel diag harness now green end-to-end (single_user_prefill/decode + multi_user_prefill/decode); 763-test wider sweep clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
runtime_api: ctx.launch honors DPPolicy.num_cubes + adds _auto_dim_remap opt-out
2026-06-01 19:53:18 -07:00 · 2026-06-01 19:33:40 -07:00 · 2026-06-01 19:14:09 -07:00 · 2026-05-22 15:37:27 -07:00 · 2026-05-22 15:19:52 -07:00 · 2026-05-22 11:15:37 -07:00
56 changed files with 11807 additions and 1484 deletions
@@ -371,6 +371,13 @@ Concrete forms that Part 1's *Verification Plan* MUST take in this repo:
 - `kernbench run --device <id>` runs the benchmark on a single device.
 - Omitting `--device` runs the benchmark on all devices discovered in the topology (logically parallel).
 - Device enumeration is handled by the CLI only; benchmarks MUST remain single-device.
+- **Eval-bench exception (ADR-0054)**: a *milestone / eval bench*
+  (`milestone-1h-*`) may drive many configurations and build its own
+  per-config engines to regenerate a domain's full result + figure set; it
+  ignores `--device` and submits a sentinel tensor to satisfy the
+  "must submit ≥1 request" contract (ADR-0045 D4). This is the eval-harness
+  carve-out to the single-device rule, alongside the ADR-0024 multi-SIP CCL
+  exception.

 ## Derived Artifacts (Clarification)

@@ -7,6 +7,11 @@ Accepted
 `tests/sccl/` 평가 하니스를 문서화한다; 구현과 대조 검증 완료
 (상수, 파일 집합, 스윕 차원을 교차 확인).

+**ADR-0054로 개정됨**: 드라이버 코어, sweep, renderer가 `milestone-1h-ccl`
+bench(단일 home)로 이동했다; `tests/sccl/_allreduce_helpers.py`는 이제 거기서
+re-export한다(pytest 전용 param 빌더 + `_run_distributed` wrapper는 로컬
+유지). figure 테스트는 변경 없음.
+
 ## Context

 ADR-0032는 intercube all-reduce *알고리즘*을 정의하고, ADR-0023/0024/0027은
@@ -8,6 +8,12 @@ GEMM 평가/특성화 하니스를 문서화한다; 구현과 대조 검증 완
 (상수, tile 크기, figure 집합, script↔test 분할을 교차 확인). D5/D6
 caveat은 부정확이 아니라 기록된 한계다.

+**ADR-0054로 개정됨**: sweep + renderer가 `milestone-1h-gemm` bench(단일
+home)로 이동했다; `scripts/gemm_sweep.py`와 `tests/gemm/`는 이제 거기서
+re-export한다. D1/D2의 "데이터 생성은 수동 script / 무거운 작업은 opt-in"은
+평가-bench 패턴으로 대체된다(하나의 bench가 전부 재생성;
+`MILESTONE_FAST=1`은 committed JSON 재사용).
+
 ## Context

 ADR-0014(PE pipeline)와 ADR-0042(tile-plan generator)는 GEMM *구현*을
@@ -0,0 +1,265 @@
+# ADR-0045: Bench Module Contract — registration, dispatch, and authoring
+
+## Status
+
+Accepted (2026-05-21).
+
+`src/kernbench/benches/` 패키지의 등록 메커니즘(@bench), CLI 디스패치 경로
+(`kernbench run/list`), 그리고 새 bench 모듈 작성 시 따라야 할 계약을 통합
+정의한다. ADR-0010 (CLI surface)이 `kernbench list/run` 인터페이스를 명세하나,
+**bench가 어떻게 등록되고 어떤 함수 시그너처를 따라야 하는가**는 ADR 레벨에
+없었음.
+
+**ADR-0054로 확장됨**: D5의 단일 구성 규칙에 세 번째 패턴이 추가된다 —
+*평가 bench*(예: `milestone-1h-*`)는 여러 구성을 구동하고, 구성별 자체 엔진을
+빌드하며, D4를 만족시키기 위해 sentinel 텐서를 제출한다.
+
+## First action (제일 처음에 하는 일)
+
+`kernbench.benches` 패키지가 임포트되면 `__init__.py` 가 즉시
+`_eager_import_and_audit(__path__, __name__)` 를 호출한다. 이 함수의 첫 일은
+패키지 디렉터리 안의 모든 형제 모듈을 `pkgutil.iter_modules(__path__)`로 나열한
+뒤, 다음 두 조건을 만족하지 않는 모듈을 모두 `importlib.import_module(...)`로
+**즉시 로드**하는 것이다:
+
+- 이름이 `registry` 인 경우 (인프라 자체)
+- 이름이 `_` 로 시작하는 경우 (helper 모듈)
+
+임포트 시점에 각 모듈 안의 `@bench(name=..., description=...)` 데코레이터가
+실행되어 `_PENDING` 리스트에 `(name, description, fn)` 튜플이 append 되고,
+`_REGISTERED_MODULES` 셋에 `fn.__module__` 가 추가된다.
+
+전체 임포트가 끝나면 `_audit_modules(imported, _REGISTERED_MODULES)` 가 호출되어,
+**임포트는 되었지만 @bench를 한 번도 호출하지 않은 모듈**이 있으면
+`RuntimeError("Bench module(s) missing @bench decorator: ...")` 가 즉시 발생한다.
+이 audit이 통과한 시점에 인덱스 할당은 아직 일어나지 않은 상태이며, 첫
+`list_all()` / `resolve(...)` 호출 시 `_finalize()` 가 이름 알파벳 정렬 순으로
+1-based index를 부여한다.
+
+즉, **bench 인프라의 첫 일은 "패키지 디렉터리의 모든 비-helper 모듈을 임포트
+하고, 각 모듈이 최소 한 번 @bench를 호출했는지 감사하는 것"** 이다.
+
+## Context
+
+`src/kernbench/benches/` 는 현재 8개의 bench 모듈을 보유한다 (`ccl_allreduce`,
+`gemm_single_pe`, `gpt3_qkv`, `ipcq_allreduce`, `matmul_composite`, `qkv_gemm`,
+`qkv_gemm_multi_pe`, `va_offset_verify`). 모든 bench는 다음 통합 흐름을 따른다:
+
+```
+kernbench run --topology <T> --bench <N>
+   ↓
+cli/main.py::cmd_run
+   ↓  resolve_topology(T)  + resolve(N)  + resolve_device(device_arg)
+   ↓
+runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
+   ↓  engine_factory(topology, device) → GraphEngine
+   ↓  RuntimeContext(engine, target_device, correlation_id, spec)
+   ↓
+bench_fn(ctx)        ← bench가 정의한 run(torch) 가 호출됨
+   ↓  ctx.empty/zeros/from_numpy/launch/distributed.* 등을 통해 submit
+   ↓
+ctx.wait_all()       ← 미완료 핸들이 있으면 drain
+   ↓
+BenchResult(completion, correlation_id, trace, traces, engine)
+```
+
+ADR-0010 은 CLI 표면만 다루고 (`run/list/probe/web`), ADR-0007 은 runtime API ↔
+sim_engine 책임 경계만 다룬다. 정작 "새 bench 파일을 추가하려면 어떤 모양으로
+써야 하는가"는 코드 컨벤션만으로 추적해야 한다. 결과적으로:
+
+- @bench 데코레이터의 호출 규약 (kebab-case 이름, non-empty description)이
+  코드에만 존재.
+- bench 함수 시그너처 (`def run(torch)`) 가 사실상 컨벤션인데, CLI 디스패치 측이
+  `spec.run` 을 호출한다는 사실로 강제되고 있음.
+- 신규 bench 추가자가 "helper 모듈은 `_` 접두로 분리해야 한다"는 것을 audit
+  RuntimeError를 받아본 뒤에야 학습.
+- single-device 컨벤션 (CLAUDE.md Part 2 CLI Semantics)이 bench 작성자 관점에서
+  어디까지 적용되는지 (CCL 멀티-SIP bench는 예외인가?) 명확하지 않음.
+
+이 ADR이 이런 모호함을 한 곳에 정리한다.
+
+## Decision
+
+### D1. @bench 데코레이터 계약
+
+```python
+from kernbench.benches.registry import bench
+
+@bench(name="my-bench", description="Short, complete-sentence description.")
+def run(torch):
+    ...
+```
+
+- `name`: kebab-case 문자열. 정규식 `^[a-z][a-z0-9]*(-[a-z0-9]+)*$` 통과 필요.
+  소문자/숫자/대시만 허용; 밑줄(`_`) 금지; 알파벳으로 시작.
+- `description`: non-empty 문자열 (strip 후 길이 > 0). CLI `list` 출력에 그대로
+  표시됨.
+- 데코레이터는 **fn을 변형 없이 반환**한다 — 즉 직접 호출도 가능. 부수효과로
+  `_PENDING` 에 등록만 추가한다.
+
+위 두 규칙 위반은 즉시 `ValueError`. duplicate name은 `_finalize()` 시점에
+`RuntimeError("duplicate bench name: ...")` 로 잡힌다.
+
+### D2. 모듈 파일 컨벤션
+
+`src/kernbench/benches/<slug>.py` 는 다음 중 하나여야 한다:
+
+- **bench 모듈**: 최상위 임포트 경로에서 적어도 한 번 `@bench(...)` 가 실행되어
+  최소 하나의 bench를 등록한다.
+- **helper 모듈**: 파일명이 `_` 로 시작 (예: `_shared_helpers.py`). `iter_modules`
+  순회에서 스킵된다.
+
+audit (`_audit_modules`) 는 helper가 아닌데도 @bench를 호출하지 않은 모듈을
+허용하지 않는다. 의도된 결과: 새 파일을 `benches/` 에 추가하기만 하면 자동
+등록되며, helper와의 구분은 **파일명 접두사** 하나로 명확하게 표시된다.
+
+### D3. bench 함수 시그너처는 `def run(torch)` 다
+
+데코레이터는 함수 이름을 강제하지 않지만, **CLI 디스패치는 `spec_entry.run`
+(즉 데코레이트된 callable) 을 호출**한다. 따라서 컨벤션은:
+
+- 함수 이름: `run`. 다른 이름으로 데코레이트해도 동작은 하지만 readability /
+  grep-ability 측면에서 항상 `run`.
+- 인자: 단일 위치 인자 `torch`. 실제로는 `RuntimeContext` 인스턴스이며 PyTorch
+  스타일의 namespace (zeros/empty/launch/distributed/...)를 노출한다 (ADR-0024 D3).
+- 반환값: 임의 (`Any`). 현재 `run_bench` 는 반환값을 무시하고 `ctx.handles()` /
+  `engine.get_completion()` 로 완료를 추적한다.
+
+`torch` 이름은 PyTorch 호환 idiom을 흉내내기 위함이며, 실제로 PyTorch 모듈이
+들어오는 것은 아니다 (ADR-0024 의 "rank = SIP" launcher 컨벤션과 정렬).
+
+### D4. bench는 최소 한 번의 submit을 수행해야 한다
+
+`run_bench` 는 `ctx.handles()` 가 비어 있는 경우 BenchResult.completion 을
+`ok=False, error_code="NO_REQUESTS"` 로 반환한다. 따라서 의미 있는 bench는
+다음 중 하나 이상을 호출해야 한다:
+
+- 텐서 생성 API: `torch.zeros(...)`, `torch.empty(...)` — 내부적으로
+  `MmuMapMsg` 와 (zeros 의 경우) `MemoryWriteMsg` 가 submit 됨.
+- 커널 실행 API: `torch.launch(name, fn, *args)` — `KernelLaunchMsg` 를 SIP 별로
+  submit.
+- (예외) 빈 placeholder bench: `ipcq_allreduce.py` 처럼 `print(...)` 만 하는
+  스텁은 NO_REQUESTS 결과를 받게 됨. CI 측에서 placeholder임을 인지하고 별도
+  처리하는 것을 가정한다.
+
+### D5. 단일-디바이스 컨벤션 + 멀티-SIP 예외 (ADR-0024/0027)
+
+CLAUDE.md Part 2 CLI Semantics 가 명시하는 **"benchmarks MUST remain
+single-device"** 컨벤션은 다음과 같이 해석된다:
+
+- **일반 bench (single-SIP 사용)**: `dp = DPPolicy(...)` 로 텐서 placement를
+  정의하고 `torch.launch(...)` 로 커널 발사. SIP 인덱스는 `--device` 가
+  결정한다 (CLI 측 책임).
+- **CCL bench (멀티-SIP 사용)**: 예외적으로 `torch.distributed.init_process_group
+  (backend="ahbm")` + `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` 로
+  rank = SIP 패턴 (ADR-0024 D3) 을 따른다. `--device` 는 무시되며 (또는
+  `all` 로 가정), 각 spawned worker가 `torch.ahbm.set_device(rank)` 로 자신의
+  SIP를 바인딩한다.
+
+이 두 패턴 외의 멀티-디바이스 호출 (예: 한 bench 함수가 동일 process에서 여러
+SIP을 직접 launch) 은 본 ADR이 금지한다. CLI 가 `--device all` 로 호출되어도
+bench는 한 번만 실행되며, 그 안에서 멀티-SIP을 다루려면 D5의 두 번째 패턴을
+사용한다.
+
+### D6. 이름·인덱스 해석 (`resolve`)
+
+`resolve(identifier: str)` 는 다음 순서로 BenchSpec을 반환한다:
+
+1. `identifier.isdigit()` → 정수 변환 후 `_REGISTRY` 의 entries에서 `index ==`
+   인 spec 반환. 없으면 `ValueError("No bench with index ..."`)`.
+2. `identifier in _REGISTRY` → 직접 lookup.
+3. 그 외 → `ValueError("Unknown bench ...")`.
+
+빈/공백 identifier 는 `ValueError("bench identifier must be a non-empty string.")`.
+
+CLI 는 `--bench` 의 인자를 그대로 `resolve` 에 넘긴다. 따라서 사용자는
+`kernbench run --bench gemm-single-pe` 또는 `kernbench run --bench 2` 형식 모두
+사용 가능.
+
+### D7. 인덱스는 안정 API가 아니다
+
+`_finalize()` 가 `_PENDING` 을 **이름 알파벳 정렬** 후 1-based index를 부여하므로,
+새 bench 가 추가되면 기존 bench의 index가 밀릴 수 있다. 따라서:
+
+- 사람-친화적 인터랙티브 사용: 인덱스 OK.
+- 스크립트 / CI 자동화: 반드시 이름을 사용한다.
+
+이 사실은 `registry.py` 모듈 docstring 에 명시되어 있다.
+
+### D8. RuntimeContext 가 bench에 노출하는 표면
+
+bench 함수가 `torch` 파라미터를 통해 정상적으로 사용할 수 있는 표면:
+
+- **텐서 생성**: `torch.empty(shape, dtype=..., dp=DPPolicy(...), name=...)`,
+  `torch.zeros(...)`, `torch.from_numpy(arr)`. 모두 host-side 메타 + 디바이스
+  배포 (MmuMap + MemoryWrite) 를 submit 한다.
+- **커널 발사**: `torch.launch(kernel_name, kernel_fn, *args)` —
+  `(Tensor, int, float)` 위치 인자를 `TensorArg` / `ScalarArg` 로 변환하여
+  SIP 별 `KernelLaunchMsg` 발행 후 drain.
+- **동기화**: `torch.wait(handle)`, `torch.wait_all()` (run_bench 가 자동 호출).
+- **분산**: `torch.distributed.init_process_group(backend="ahbm")`,
+  `torch.distributed.get_world_size()`, `torch.distributed.all_reduce(t, op=...)`
+  (ADR-0024/0027).
+- **멀티-프로세스 (rank=SIP)**: `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)`
+  (ADR-0024 D3 / ADR-0027).
+- **디바이스 바인딩**: `torch.ahbm.set_device(rank)` 또는
+  `torch.accelerator.set_device_index(rank)` (둘 다 같은 namespace를 가리킴).
+- **IPCQ 설치**: `torch.install_ipcq(algorithm=..., ccl_yaml=...)` (ADR-0023 D10).
+- **스펙 조회**: `torch.spec` — 토폴로지 빌더가 만든 dict (시스템·cube_mesh·HBM
+  파라미터 등). bench가 toplogy.yaml 파라미터에 의존하지 않게 짜기 위함.
+
+bench는 위에 열거되지 않은 RuntimeContext 의 private 멤버 (`_handles`, `_traces`,
+`_allocators` 등) 에 직접 접근해선 안 된다. ADR-0007 의 layer boundary 정신과
+정렬: bench → runtime API → sim_engine 한 방향만 허용.
+
+### D9. 환경 변수로 파라미터화는 허용된다
+
+`matmul_composite.py` 처럼 `os.environ.get("MATMUL_M", ...)` 등으로 bench
+파라미터를 외부에서 주입하는 패턴은 허용한다. 이유:
+
+- bench 함수 시그너처는 D3 에 의해 `def run(torch)` 로 고정되어 있어 위치/키워드
+  인자로 파라미터를 받기 곤란.
+- 환경 변수 패턴은 `MATMUL_VARIANT` 같은 운영-시 스윕을 위한 자연스러운 hook.
+- `scripts/gemm_sweep.py` 같은 외부 드라이버 (ADR-0044) 가 이 hook을 사용한다.
+
+단, 환경 변수가 bench의 동작을 바꾼다면 모듈 docstring 에 모든 변수를 명시할 것
+(matmul_composite.py 가 그 예시).
+
+## Alternatives Considered
+
+### A1. 명시적 manifest 파일 (YAML)에 bench 목록 두기
+
+기각. @bench 데코레이터 + audit 패턴은 "파일 추가 = 자동 등록" 을 보장하여 신규
+bench 작성자의 인지 비용을 한 곳 (파일 작성)으로 집중시킨다. 별도 manifest는
+유지보수 측에서 drift 위험이 크고, helper 분리는 이미 `_` 접두로 명확하다.
+
+### A2. bench 함수 이름을 데코레이터 인자로 받기 (`@bench(name=..., entry="run_xxx")`)
+
+기각. 디스패치 측에서 `spec.run` 하나만 호출하면 되는 단순함을 깬다. `run` 컨벤션
+하나로 충분하며, 변종이 필요하면 같은 모듈에 여러 함수를 등록하면 된다 (각각
+@bench 데코레이트).
+
+### A3. CCL bench를 위한 별도 `@multi_device_bench` 데코레이터
+
+기각. D5에서 명시한 두 패턴 (single + ADR-0024 멀티-SIP) 만으로 현재 8개 bench가
+모두 표현 가능. 별도 데코레이터는 디스패치 측에서 분기를 강제하여 복잡도를 늘리며,
+멀티-SIP 사용 의도는 bench 함수 본문의 `init_process_group(...)` 호출로 충분히
+드러난다.
+
+### A4. 인덱스를 안정 API로 만들기 (등록 순서 / explicit index= 인자)
+
+기각. D7에서 명시한 trade-off — 사용자 친화성 (알파벳 정렬된 인덱스가 list 출력
+에서 자연스럽게 1, 2, 3...) 우선. 스크립트는 이름으로 지정하면 충분.
+
+## Consequences
+
+- "bench 추가 방법" 이 한 ADR로 정리됨 → 신규 작성자가 코드 grep 없이 D1-D3,
+  D8 만 따르면 됨.
+- helper 모듈을 `_` 접두로 분리하는 패턴이 ADR-level에서 정당화되어, 향후
+  `benches/_*.py` 식의 공유 helper 작성이 자유로워짐.
+- CLAUDE.md Part 2 CLI Semantics 의 single-device 컨벤션이 멀티-SIP CCL bench
+  와 모순되지 않음을 D5 가 명시 — 둘은 직교한다.
+- ADR-0044 (GEMM eval harness) 의 `scripts/gemm_sweep.py` 가 환경 변수 hook을
+  사용하는 근거 (D9) 가 본 ADR에 굳어짐.
+- 인덱스가 불안정함 (D7) 이 명시되어, CI 측 `kernbench run --bench 3` 같은
+  코드는 본 ADR 수락 직후 점검 대상.
@@ -0,0 +1,307 @@
+# ADR-0046: TLContext — Kernel-side `tl.*` API Contract
+
+## Status
+
+Accepted (2026-05-22).
+
+`src/kernbench/triton_emu/` 의 `TLContext` 가 노출하는 `tl.*` primitive
+집합과 그 의미, 그리고 두 실행 모드 (command-list / greenlet runner) 의
+계약을 명시한다. ADR-0014/0020 가 PE 파이프라인과 2-pass 실행 모델을
+정의하나, **bench 의 kernel 함수가 호출하는 `tl.*` 표면 자체**는 ADR-level
+에 정리되어 있지 않았다.
+
+## First action (제일 처음에 하는 일)
+
+`TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id, num_cubes,
+scratch_base, scratch_size)` 생성 시 가장 먼저 다음 6개 필드를 초기화한다:
+
+- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
+  `tl.program_id` / `tl.num_programs` 가 반환할 값.
+- `self._dispatch_cycles` — 모든 `tl.*` API 호출 시작에서 자동으로 발행될
+  `PeCpuOverheadCmd(cycles)` 의 cycle 수.
+- `self._runner` — `KernelRunner` 인스턴스 (있으면 greenlet 모드, 없으면
+  command-list 모드).
+- `self._commands: list[PeCommand] = []` — command-list 모드에서 누적할
+  command 시퀀스.
+- `self._handle_counter = 0`, `self._completion_counter = 0` — 새 TensorHandle /
+  CompletionHandle id 생성용.
+- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
+  PE-로컬 scratch 영역 (math/dot/composite 의 output handle 주소 할당용).
+
+즉, **TLContext 의 첫 일은 "이 kernel 인스턴스가 어디서 (sip/cube/pe) 어떤
+규모 (num_programs/num_cubes) 로 실행되며, 어느 모드 (runner 유무) 로
+명령을 발사할지 메타데이터를 채우는 것"** 이다. 이 시점에 SimPy event 는
+없으며 command 도 발사되지 않는다.
+
+런타임 첫 동작은 kernel 함수가 `tl.<api>()` 를 처음 호출할 때 발생한다.
+모든 `tl.*` API 의 표준 entry 동작은:
+
+1. `self._emit_dispatch_overhead()` 호출 — `dispatch_cycles > 0` 인 경우
+   `PeCpuOverheadCmd(dispatch_cycles)` 를 즉시 `_emit`.
+2. API 별 처리 (TensorHandle 생성, command 구성).
+3. `self._emit(cmd)` — runner 모드면 greenlet.switch 로 SimPy 측에 cmd 전달,
+   아니면 `self._commands` 에 append.
+
+## Context
+
+`tl.*` 표면은 `TLContext` 가 노출하는 메소드들로 구성되며, kernel 함수가
+받는 `tl` 매개변수가 이 객체다. 사용자(bench 작성자) 입장에서 보이는
+contract:
+
+- 어떤 primitive 가 있는가
+- 각 primitive 가 어떤 데이터 흐름을 발생시키는가 (DMA / compute / IPCQ /
+  metadata-only)
+- TensorHandle 의 `space` 와 `addr` 가 어떻게 결정되는가
+- command-list 모드와 greenlet 모드의 차이
+
+ADR-0014 (PE pipeline) 가 PE_SCHEDULER 가 받는 PeCommand 들을 정의하나,
+`tl.*` 가 이들을 어떻게 emit 하는지는 코드 컨벤션에만 존재한다. 또한
+ADR-0020 (2-pass data execution) 가 greenlet 모드의 존재를 D3 에서
+언급하나, runner / non-runner 두 경로의 시그너처 차이 (return value 처리)
+는 ADR-level 에 명시되어 있지 않다. 이 ADR 이 그 빈자리를 채운다.
+
+## Decision
+
+### D1. `tl` 매개변수는 `TLContext` 인스턴스다
+
+bench 의 kernel 함수는 다음 시그너처를 따른다:
+
+```python
+def _kernel(arg1, arg2, ..., tl, **kwargs):
+    ...
+```
+
+`tl` 의 정체는 `kernbench.triton_emu.tl_context.TLContext` 인스턴스이다.
+real Triton 의 `triton.language` 모듈을 흉내내기 위한 이름이며, real
+Triton 모듈이 들어오는 것은 아니다.
+
+kernel 함수는 일반 Python 함수이며 `yield` / `async` 가 없다. `tl.*`
+호출이 SimPy event 를 발생시키지만, 호출자(kernel) 쪽에서는 동기 호출처럼
+보인다 — greenlet 모드에서 KernelRunner 가 SimPy ↔ kernel 사이를 중계
+하기 때문 (ADR-0020 D3).
+
+### D2. 두 실행 모드 — command-list / greenlet runner
+
+- **command-list 모드 (`runner is None`)**: `tl.*` 호출이 `self._commands`
+  리스트에 PeCommand 를 누적. DMA / GEMM / Math 가 실제 SimPy 시간을
+  소비하지 않으며, return value 가 metadata-only TensorHandle (data=None) 다.
+  이후 PE_SCHEDULER / sim_engine 가 command 시퀀스를 시간상 재생.
+
+- **greenlet runner 모드 (`runner is not None`)**: `tl.*` 호출이
+  `self._emit(cmd)` 를 통해 `runner.switch_to_simpy(cmd)` 로 부모 greenlet
+  (SimPy) 으로 컨트롤을 넘김. 부모는 cmd 를 컴포넌트에 분배하여 SimPy 시간을
+  소비한 뒤, DMA read 의 경우 실제 numpy 데이터를 반환. kernel 은 그
+  결과를 받아 다음 line 으로 진행 (ADR-0020 D3 의 데이터 인지 실행 모델).
+
+mode 선택은 KernelRunner 인스턴스를 TLContext 에 주입하는지 여부로 결정
+되며, `tl.*` 메소드들은 이 차이를 인지하지 않고 `_emit()` 헬퍼를 통해
+일관되게 동작한다.
+
+### D3. Primitive 카테고리
+
+#### D3.1. Reference (no DMA, metadata only)
+
+- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: HBM 데이터를 참조하는
+  핸들만 만들고 DMA 는 발행하지 않음. composite scheduler 가 per-tile 로
+  스트리밍할 때 사용 (예: GEMM 의 b 피연산자).
+
+#### D3.2. Data movement (blocking, DMA engine)
+
+- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → 결과 핸들.
+  `DmaReadCmd` 발행. greenlet 모드에서는 결과 핸들의 `.data` 에 실제
+  numpy 배열 첨부; command-list 모드에서는 placeholder. 반환 핸들의
+  `space="hbm"`, `pinned=True`.
+- `tl.store(ptr, handle) -> None`: TCM → HBM. `DmaWriteCmd` 발행. greenlet
+  모드에서는 `handle.data` 가 있을 때만 `_store.write("hbm", ptr, data)` 를
+  먼저 호출 (visibility = issue time, ADR-0020 D3).
+
+#### D3.3. GEMM / compute (blocking)
+
+- `tl.dot(a, b) -> TensorHandle`: `a @ b`. 두 피연산자는 TCM 이어야 하며,
+  shape (M,K) × (K,N) → (M,N). `GemmCmd` 발행, output handle 은
+  `_make_compute_out(shape, dtype)` 로 PE-로컬 scratch 에 할당.
+- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
+  acc_dtype=None, tile_shape=None) -> CompletionHandle`: 비차단(non-blocking)
+  tiled pipeline. `CompositeCmd` 발행. `epilogue` 는 dict list, 각 dict 는
+  `"op"` 키 + op-specific 필드 + 옵션 `"scope"` (k_tile / output_tile);
+  unknown op 나 missing field 는 즉시 ValueError. 반환된 CompletionHandle 은
+  `tl.wait(h)` 로 동기화.
+
+#### D3.4. Math: unary (blocking)
+
+- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
+  `tl.cos(x)`, `tl.sin(x)` — 모두 `MathCmd(op=<name>, inputs=(x,), out=)`
+  발행. `out` 은 동일 shape/dtype 의 scratch 할당.
+
+#### D3.5. Math: binary (blocking)
+
+- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
+- `tl.fma(a, b, c)` — `a*b + c`. inputs 3개.
+- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
+- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
+- `tl.softmax(x, axis=-1)` — 단일 MathCmd(op="softmax") 로 시간 회계는
+  한 번에. Phase 2 DataExecutor 가 canonical (x-max → exp → sum → div) 로
+  expand 한다.
+
+#### D3.6. Reduction (blocking)
+
+- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — 해당 axis 의
+  크기를 1 로 줄인 output handle 을 반환. `MathCmd(op=<name>, inputs=(x,),
+  out=, axis=axis)` 발행.
+
+#### D3.7. Index / scalar (PE_CPU, no engine)
+
+- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE 인덱스),
+  `axis==1` → cube_id (ADR-0022).
+- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (cube 당
+  PE 수), `axis==1` → num_cubes.
+- `tl.arange(start, end, dtype="i32") -> TensorHandle`: TCM 의 인덱스
+  range. command 발사 없이 metadata 만.
+- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape, value,
+  dtype="f16") -> TensorHandle`: TCM 에 placeholder. command 발사 없음.
+
+#### D3.8. Scalar helpers (no command, no engine)
+
+- `TLContext.cdiv(a, b) -> int` (static): ceiling division
+  `-(-a // b)`. real Triton 의 `tl.cdiv` 모방.
+
+#### D3.9. Metadata-only (no compute, no DMA)
+
+- `tl.trans(x) -> TensorHandle`: shape 의 마지막 두 dim 을 swap 한 새
+  핸들. 같은 addr/data 를 공유, command 발사 없음.
+
+#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
+
+- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
+  dtype="f16", space="tcm") -> None`: blocking send. handle 형태 또는
+  raw 주소 형태 둘 다 허용. `IpcqSendCmd` 발행. handle 의 `.data` 스냅샷이
+  명령에 실리는 경우, recv 측에서 받은 데이터의 race 회피.
+- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
+  dst_space=None) -> TensorHandle`: blocking recv. `dst_addr/dst_space`
+  둘 다 주면 "copy_to_dst" 모드, 아니면 "return_slot" 모드. greenlet
+  모드에서 핸들의 `.data` 에 실제 데이터 첨부.
+- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
+  **DIAGNOSTIC ONLY**. recv blocking 동기화는 그대로 적용되나 slot-read
+  latency (slot-IO + PE↔bank fabric drain) 는 건너뛴다. pe2pe overview
+  플롯에서 `tl.store` 와의 apples-to-apples 비교용. production kernel 은
+  사용 금지 — `consume=False` 라는 별도 명령 분기로 격리되어 있어 실수
+  flag 가 작동하지 않는다.
+- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
+  recv. `RecvFuture` 를 반환; 이후 `tl.wait(future)` 로 결과 수령.
+
+#### D3.11. Composite + control
+
+- `tl.composite(...)`: D3.3 에서 설명.
+- `tl.wait(handle=None)`: `CompletionHandle` (composite) 또는 `RecvFuture`
+  (async recv) 또는 `None` (모든 pending composite) 대기.
+- `tl.cycles(n)`: PE_CPU scalar 실행 overhead 를 명시적으로 선언.
+  `PeCpuOverheadCmd(cycles=n)` 발행.
+
+### D4. TensorHandle 산술 연산자 — thread-local TLContext
+
+`tl_context.py` 모듈 로드 시점에 `_enable_tensor_ops()` 가 호출되어
+`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__` 를 patch한다.
+각 연산자는 thread-local `_ctx` (모듈 변수) 에 저장된 active TLContext 의
+`_binary_math` 를 호출한다.
+
+따라서 kernel 안에서 `c = a + b` 는 `MathCmd(op="add", inputs=(a,b),
+out=)` 발행 + new TensorHandle 반환 패턴과 동일하다.
+
+active TLContext 관리:
+
+- `TLContext._set_active(ctx)`: 현재 thread/greenlet 의 active ctx 설정.
+- `TLContext._get_active()`: 조회 (없으면 RuntimeError).
+- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. 진입 시
+  active 설정, kernel 실행, 종료 시 None 으로 복원.
+
+`KernelRunner` 는 매 cmd 분배 시 `_switch_kernel` 가 직접 `_set_active(tl)`
+를 호출하여, 같은 thread 안의 다른 PE runner 가 active 를 덮어쓴 경우에도
+복원되도록 한다.
+
+### D5. Scratch allocator — compute output handles
+
+`tl.dot`, `tl.exp`, `tl.add` (TensorHandle `__add__`) 등 결과를 만드는 op 는
+`_make_compute_out(shape, dtype)` 를 호출하여 16-byte aligned scratch
+주소를 할당한다. 이 주소는 `space="tcm"` 로 발행되며, 이후 `tl.send` /
+`tl.store` 가 이 handle 을 source 로 사용할 수 있다.
+
+`_scratch_base == 0` (command-list 모드 등) 이면 할당 주소가 0으로
+반환되어 handle 은 send/store 의 source 로 사용 불가 (이 경우 `tl.load`
+로 받은 핸들만 source 가 될 수 있다).
+
+cursor 가 `_scratch_size` (default 1 MiB) 를 초과하면 RuntimeError.
+cursor 는 매 kernel invocation 시작 시 0 으로 리셋되어야 하나 (현재 코드는
+KernelRunner 가 새 TLContext 를 매번 생성하여 자연스럽게 리셋됨).
+
+### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
+
+모든 non-metadata `tl.*` 호출의 entry 에서 `_emit_dispatch_overhead()` 가
+호출되며 `dispatch_cycles > 0` 일 때 `PeCpuOverheadCmd(dispatch_cycles)`
+를 발행한다. PE_CPU 가 명령 dispatch 자체에 소비하는 cycle 비용을
+모델링하기 위함이다.
+
+기본값:
+
+- `TLContext.__init__` 의 `dispatch_cycles` 매개변수 기본값: 1 cycle.
+- `KernelRunner` 가 만드는 TLContext: 0 cycles (greenlet 모드는 cycle
+  회계가 별도, ADR-0020 D3 정신).
+
+### D7. Kernel registry (`triton_emu/registry.py`)
+
+별도의 `_kernels: dict[str, Callable]` 가 kernel 이름 → 함수 매핑을 보유:
+
+- `register_kernel(name, fn)`: duplicate 등록 시 ValueError.
+- `get_kernel(name)`: 미등록 시 KeyError.
+- `clear_registry()`: 테스트 전용.
+
+`RuntimeContext.launch(kernel_name, kernel_fn, *args)` 가 매 호출마다
+`_kernels[kernel_name] = kernel_fn` 으로 idempotent 덮어쓴다 (last call
+wins). 이는 ADR-0045 D8 의 launch 동작과 정합된다.
+
+PE_CPU 는 `KernelRef.name` 으로 registry 에서 kernel 함수를 lookup 한 뒤
+KernelRunner 로 실행한다.
+
+## Alternatives Considered
+
+### A1. tl.* 를 ADR-0014 / ADR-0020 안으로 통합
+
+기각. ADR-0014 는 PE pipeline (PeCommand 의 sim_engine 측 소비) 를, ADR-0020
+은 2-pass 실행 (Phase 1 timing / Phase 2 data) 을 다룬다. `tl.*` 는 kernel
+작성자가 만나는 API 표면이라 독립 분리하는 것이 검색성·온보딩 측면에서
+낫다.
+
+### A2. command-list 모드 deprecation
+
+기각 (현재). 단순한 unit test 와 kernel verification 에서 command-list
+모드가 가볍게 동작한다. greenlet 의존성 없이 PeCommand 시퀀스를 검사할 수
+있는 출입구로 유지한다. greenlet 모드만의 의미 (실데이터, Phase 2) 가
+필요하면 D2 의 mode 선택으로 명시적으로 들어간다.
+
+### A3. TensorHandle 산술 연산자 제거
+
+기각. real Triton 의 kernel 코드 가독성을 흉내내기 위함이며 (예: `c = a +
+b`), thread-local active ctx 패턴이 깔끔하게 작동 중. 명시적 `tl.add(a, b)`
+도 D3.5 에 노출되어 있어, 연산자가 헷갈리면 함수형 호출로 대체 가능.
+
+### A4. softmax 를 명시적 시퀀스 (max → exp → sum → div) 로 expand
+
+부분 채택. `tl.softmax` 는 단일 `MathCmd(op="softmax")` 로 timing 회계는
+한 번에 처리한다 (D3.5). 실 데이터 expansion 은 Phase 2 DataExecutor 가
+canonical 시퀀스로 풀어준다. 즉, 시간 모델은 atomic, 데이터 모델은
+expansion — 두 마리 토끼를 의도적으로 분리.
+
+## Consequences
+
+- bench 작성자가 만나는 모든 `tl.*` primitive 가 한 ADR 에 분류·정의됨.
+  ADR-0045 D8 의 host-side surface (torch.empty 등) 와 짝을 이루어 "kernel
+  안 / 밖" 양쪽 작성 가이드가 완성.
+- command-list / greenlet 두 모드의 차이가 D2 에 명시되어, 새로운 `tl.*`
+  primitive 추가 시 `_emit()` 패턴만 따르면 양쪽 자동 호환됨.
+- thread-local active ctx 패턴 (D4) 이 ADR-level 에서 정당화되어, 향후
+  multi-PE 동일-thread 실행 시 reset 책임이 어디인지 명확해짐
+  (`_switch_kernel` 가 cmd 분배 시 active 복원 — KernelRunner.run 의
+  contract).
+- `tl.recv_no_consume` 의 진단 전용 격리(D3.10) 가 ADR 에 굳어져, 실수로
+  production kernel 에서 사용되는 것을 막는 layer 가 명확.
+- registry (D7) 가 별도 D 항목으로 분리되어, kernel 이름 충돌 / 동적
+  재등록 동작의 사양이 명시.
@@ -0,0 +1,243 @@
+# ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
+
+## Status
+
+Accepted (2026-05-22).
+
+`runtime_api/distributed.py` 의 `AhbmCCLBackend` + `DistributedContext` —
+즉 `torch.distributed.init_process_group(backend="ahbm")` 진입점이 실제로
+무엇을 설치하고 어떤 의미로 `all_reduce`/`barrier`/`get_rank` 등을
+구현하는지를 명시한다. ADR-0023 D11 이 "torch.distributed compatibility"
+의도를 언급하나, **backend 자체의 동작 모델**은 ADR-level 에 없었다.
+
+## First action (제일 처음에 하는 일)
+
+`RuntimeContext.__post_init__` 가 자동으로 `DistributedContext()` 인스턴스를
+만들어 `self.distributed` 에 attach 한다. 그 시점의 첫 일은:
+
+1. `self._backend: AhbmCCLBackend | None = None` 으로 초기화 (아직 init
+   되지 않은 상태).
+2. `self._rank_by_greenlet: dict = {}` 로 greenlet-local rank 레지스트리
+   초기화 (ADR-0024 D2).
+3. 호출자(RuntimeContext) 측에서 `dc._ctx_ref = self` 로 back-reference 를
+   심어, 이후 `init_process_group` 가 `ctx.engine` / `ctx.spec` / `ctx.launch`
+   에 도달할 수 있게 한다.
+
+즉, **DistributedContext 의 첫 일은 "RuntimeContext 에 자기 자신을
+back-reference 와 함께 부착하고 backend 슬롯을 비워두는 것"**. 실제 backend
+설치(IPCQ install, world_size 산출, 알고리즘 모듈 로드)는 사용자 코드의
+`torch.distributed.init_process_group(backend="ahbm")` 호출 시점에 비로소
+일어난다.
+
+해당 시점의 `init_process_group` 의 첫 일은:
+
+1. `backend != "ahbm"` 이면 즉시 `ValueError("Unsupported backend ...")`.
+2. `getattr(self, "_ctx_ref", None)` 가 None 이면
+   `RuntimeError("DistributedContext not bound to a RuntimeContext")`.
+3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — 이 생성자 안에서
+   ccl.yaml load + 알고리즘 모듈 import + world_size 산출 + SFR 설정 +
+   IPCQ install 이 모두 일어난다.
+4. `self._backend._dist_ctx = self` — backend 가 거꾸로
+   `_rank_by_greenlet` 에 접근할 수 있게 함.
+
+## Context
+
+PyTorch DDP 의 collective 호출 (`init_process_group`, `all_reduce` 등) 을
+그대로 사용할 수 있게 만들어, bench 코드가 "진짜 DDP training script" 와
+동일한 모습이 되도록 하는 것이 `AhbmCCLBackend` 의 목적이다 (ADR-0024 +
+ADR-0027 의 launcher 모델과 정렬).
+
+이 backend 가 책임지는 것:
+
+- `init_process_group` 시점에 **IPCQ neighbor table 을 한 번 설치** (real
+  NCCL communicator creation 과 유사).
+- `all_reduce(tensor, op="sum")` 호출 시 **설정된 algorithm 의 kernel 함수
+  를 `ctx.launch(...)` 로 발사**.
+- `get_world_size` / `get_rank` 를 greenlet-local rank 레지스트리와
+  ccl.yaml/topology 로부터 일관되게 답함.
+
+ADR-0023 D10 (IPCQ install plan), ADR-0024 (SIP launcher) 가 부분적으로
+이를 다루나, **`AhbmCCLBackend` 자체의 책임 범위와 의사결정 순서**는
+어디에도 명시되어 있지 않다. 본 ADR 이 채운다.
+
+## Decision
+
+### D1. backend 는 `init_process_group(backend="ahbm")` 시점에만 생성된다
+
+`DistributedContext` 는 `__init__` 시점에 `_backend = None` 으로 시작한다.
+backend 객체는 사용자가 `dist.init_process_group(backend="ahbm")` 를
+호출하기 전까지 존재하지 않으며, 그 외 API (`is_initialized`,
+`get_world_size`, `all_reduce`, `barrier`) 가 backend 가 None 인 채로
+호출되면 `RuntimeError("Default process group has not been initialized...")`
+를 던진다 (`_ensure_initialized` 헬퍼).
+
+`backend != "ahbm"` 은 즉시 `ValueError`. 다른 backend 명 (nccl, gloo
+등) 은 인식하지 않는다.
+
+### D2. world_size 산출 우선순위 — algorithm > defaults > topology
+
+`AhbmCCLBackend._resolve_world_size` (ADR-0024 D1) 의 결정 순서:
+
+1. `ccl.yaml` 의 algorithm entry 에 `world_size` 가 있으면 그 값.
+2. `defaults.world_size` 가 있으면 그 값.
+3. 둘 다 없으면 `spec.system.sips.count` (=topology 의 SIP 개수).
+
+기본 의미는 **rank = SIP** (ADR-0024). cube/PE-level parallelism 은 각
+rank 안에서 DPPolicy 로 표현되며 world_size 에 영향을 주지 않는다. 명시적
+`ccl.yaml` 의 world_size override 가 있으면 legacy "rank = flat PE 인덱스"
+테스트 경로를 위해 그대로 존중된다.
+
+`init_process_group(world_size=..., rank=...)` 의 사용자 인자는 **수신하나
+무시**된다 (real PyTorch 의 `RANK` / `WORLD_SIZE` env var 와 같은 의미).
+
+### D3. `init_process_group` 가 즉시 하는 4가지 설치 작업
+
+`AhbmCCLBackend.__init__` 안에서 다음이 순차 실행된다:
+
+1. **ccl.yaml 로딩**: `kernbench.ccl.install.load_ccl_config()` →
+   `resolve_algorithm_config(_cfg_all)` 로 `defaults.algorithm` (또는
+   사용자가 지정한 알고리즘) 의 merged config 산출.
+2. **알고리즘 모듈 import**: `importlib.import_module(self._merged["module"])`.
+   이 모듈은 `kernel` 함수, `kernel_args(world_size, n_elem, cube_w, cube_h)`
+   helper, optional `TOPO_NAME_TO_KIND` 매핑을 노출해야 한다.
+3. **world_size 산출** (D2).
+4. **topology 메타 수집**: `spec` 으로부터 `n_sips`, `sip_topo` (`ring_1d`
+   기본), `cube_w`/`cube_h`, `sips.w`/`sips.h`. SIP topology 가 ring_1d 가
+   아니면 explicit `w`/`h` 또는 square root 로 (`w*h == n_sips` 보장)
+   `_sip_topo_w/h` 산출. 불일치 시 `ValueError`.
+5. **SFR + IPCQ 설치**: `kernbench.ccl.sfr_config.configure_sfr_intercube_multisip
+   (engine, spec, self._merged)` 를 호출. 이 함수가 모든 SIP/cube 의 pe0 에
+   IPCQ neighbor table 을 푸시 (real NCCL communicator 의 일회성 설정에
+   해당).
+
+이 순서가 변하면 (예: SFR 전에 algorithm 모듈 load 가 실패하면) 부분 초기화
+상태가 발생할 수 있다. 따라서 D3 는 atomic 한 4-단계로 본다 — 실패 시
+backend 는 미설치 상태로 남는다.
+
+### D4. greenlet-local rank 등록 (ADR-0024 D2)
+
+`DistributedContext._rank_by_greenlet: dict[greenlet, int]` 은 spawn 된
+worker greenlet 각각에 rank 를 매핑한다. bench launcher (예:
+`torch.multiprocessing.spawn`) 가 worker 를 띄울 때
+`dc._bind_rank(g, rank)` 를 호출하여 등록한다.
+
+`get_rank()` 는 `getcurrent()` 의 greenlet 을 lookup. 미등록 greenlet은
+fallback 으로 0 을 반환 — single-driver / 테스트 호환성 유지.
+
+backend 는 `_dist_ctx._rank_by_greenlet` 를 통해 `all_reduce` 시 현재
+greenlet 의 rank 를 가져온다 (D5).
+
+### D5. `all_reduce(tensor, op="sum")` 동작
+
+검증 단계:
+
+- `op != "sum"` → `NotImplementedError`. 현재 kernel 들은 add reduction만 구현.
+- `tensor._handle is None` → `RuntimeError("not deployed")`.
+- `tensor._handle.shards` 가 비면 `RuntimeError("no shards")`.
+
+준비 단계:
+
+- `n_elem = shards[0].nbytes // tensor.itemsize` — 단일 shard 의 element 수.
+- `kernel_fn = self._algo_module.kernel` — D3 에서 import 된 알고리즘 모듈의
+  진입 함수.
+- effective cube dims 결정: 첫 번째 SIP 의 cube 갯수가 1 이면 (1,1) 으로
+  scalar 처리, 아니면 토폴로지의 `cube_w`/`cube_h` 사용. TP 가 일부 cube
+  만 쓰는 경우를 자연스럽게 흡수.
+- `kernel_args = self._algo_module.kernel_args(world_size, n_elem, cube_w,
+  cube_h)` — 알고리즘이 자기 kernel 에 넘길 인자 셋을 결정.
+
+dispatch:
+
+- 현재 greenlet 의 rank 를 `_rank_by_greenlet.get(g, 0)` 로 lookup.
+- `extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` 를 append.
+- `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor, *kernel_args,
+  *extra_args, _defer_wait=True)` — `_defer_wait=True` 로 collective drain
+  을 메인 scheduler 에 위임 (ADR-0027 D0.4).
+
+drain:
+
+- 부모 greenlet 이 살아있으면 (multi-greenlet 모드) `_pending_collective_handles`
+  에 enqueue 한 뒤 부모로 switch. 메인 scheduler 가 모든 rank 의 launch 후
+  일괄 drain.
+- 단일-driver 모드면 inline 으로 `for h, _sip_id, meta in pending:
+  self.ctx.wait(h, _meta=meta)` 즉시 drain.
+
+### D6. `barrier()` 는 no-op 이다 (single-driver 모델)
+
+kernbench 는 하나의 Python process 안에서 모든 rank 를 greenlet 으로 다룬다.
+process 간 동기화가 필요한 상황이 없으므로 `barrier()` 는 호출 가능하지만
+실제 어떤 동기화도 수행하지 않는다. real PyTorch DDP 와의 API 호환성을
+위해 유지 (호출자가 NotImplementedError 를 받지 않도록).
+
+장래에 multi-process kernbench (예: SimPy event loop 가 process 별로
+독립) 가 도입되면 D6 를 supersede 하는 새 ADR 이 필요.
+
+### D7. `get_rank` / `get_world_size` / `get_backend` 의 의미
+
+- `get_rank()` (D4): 현재 greenlet 의 bound rank. 미등록은 0.
+- `get_world_size()` (D2): backend 가 D3 에서 산출한 world_size.
+- `get_backend()`: 항상 `"ahbm"` 문자열. backend 객체가 존재하지 않으면
+  `_ensure_initialized` 에서 RuntimeError.
+
+real PyTorch 와의 차이:
+
+- real PyTorch `get_rank()` 는 process global 값이지만, kernbench 는
+  greenlet-local. spawn 된 worker 안에서 호출하면 rank, main thread 에서
+  호출하면 0. bench 작성자는 worker 함수 안에서만 의미 있는 rank 를 기대해야
+  한다.
+
+### D8. 지원하는 API 표면 (final)
+
+`DistributedContext` 가 노출하는 API:
+
+- `init_process_group(backend="ahbm", world_size=None, rank=None, **kwargs)`
+- `is_initialized() -> bool`
+- `get_world_size() -> int`
+- `get_rank() -> int`
+- `get_backend() -> str`
+- `all_reduce(tensor, op="sum") -> None`
+- `barrier() -> None`
+- (internal) `_bind_rank(g, rank)`
+
+이외의 PyTorch distributed API (broadcast, reduce, all_gather, gather,
+scatter, send/recv 등) 는 **아직 구현되어 있지 않다**. kernel 레벨에서는
+`tl.send`/`tl.recv` (ADR-0046 D3.10) 로 직접 표현 가능하나, dist.* surface
+로는 노출되지 않는다. 추가 collective 가 필요해질 시 별도 알고리즘 모듈
+ `DistributedContext` 메소드 한 쌍을 추가하여 D8 를 확장한다.
+
+## Alternatives Considered
+
+### A1. backend 를 `RuntimeContext.__init__` 에서 즉시 생성
+
+기각. ccl.yaml 이 없거나 알고리즘 모듈을 import 할 수 없는 경우, bench 가
+distributed 기능을 안 쓰는데도 RuntimeContext 생성 자체가 실패하게 된다.
+"호출 시점에 비로소 설치" (D1) 가 lazy 의미상 옳다.
+
+### A2. world_size 를 항상 topology 로부터 자동 산출 (override 금지)
+
+기각. ADR-0024 D1 의 "explicit override" 경로가 legacy 테스트에서 사용 중.
+한 SIP 안에서 PE-level rank 를 따로 정의해야 하는 진단 시나리오를 위해
+유지.
+
+### A3. `op != "sum"` 을 silent fallback 으로 처리
+
+기각. 사용자가 `op="prod"` / `"max"` / `"avg"` 를 의도했는데 silently sum
+이 실행되면 결과 검증이 매우 어렵다. 명시적 `NotImplementedError` 가 안전.
+
+### A4. `barrier` 를 SimPy event 로 구현
+
+기각 (현재). single-driver 모델에서 cross-process 동기화 의미가 없으므로
+no-op 가 의미적으로 정확. SimPy fake-barrier 는 의미 없이 코드 복잡도만
+높임. multi-process kernbench 도입 시 재평가.
+
+## Consequences
+
+- `torch.distributed.init_process_group(backend="ahbm")` 의 4-단계 설치
+  (D3) 가 ADR-level 에서 굳어져, 향후 새 collective 알고리즘이 어디에
+  훅을 걸어야 하는지 명확.
+- D2 의 우선순위 (algorithm > defaults > topology) 가 명시되어, ccl.yaml
+  변경 시 영향 범위를 빠르게 가늠 가능.
+- D6 의 barrier no-op 결정이 ADR-level 에 굳어져, multi-process kernbench
+  도입 시 별도 ADR 로 supersede 해야 함이 분명.
+- D8 의 미지원 API 목록이 명시되어, 사용자가 `dist.broadcast(...)` 를
+  호출하려 할 때의 명확한 거절 근거 제공.
@@ -0,0 +1,262 @@
+# ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
+
+## Status
+
+Accepted (2026-05-22).
+
+`policy/address/allocator.py` 의 `_FreeList` / `PEMemAllocator` 와
+`va_allocator.py` 의 `VirtualAllocator` 가 사용하는 free-list 알고리즘,
+페이지 정렬, coalescing 규칙을 명시한다. ADR-0001 (PhysAddr 레이아웃) 과
+ADR-0011 (PA/VA/LA 모델) 이 주소 스킴을 정의하나, **할당 알고리즘**은 별도
+ADR 이 없었다.
+
+## First action (제일 처음에 하는 일)
+
+### `_FreeList(capacity)`
+
+생성 즉시 `self._capacity = capacity`, `self._used = 0`, `self._free =
+[(0, capacity)]` 로 초기화. 첫 일은 **전 영역을 single free block 으로
+세우는 것** — 즉 `(offset=0, size=capacity)` 한 튜플이 free list 의 유일한
+원소다.
+
+### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
+
+생성 즉시 두 개의 `_FreeList` 를 만든다:
+
+- `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — 이 PE 가 소유한 HBM
+  slice 의 바이트 크기 (`hbm_bytes_per_cube // hbm_slices_per_cube`) 만큼.
+- `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — `tcm_bytes_per_pe -
+  tcm_scheduler_reserved_bytes` 만큼 (scheduler 예약분은 사전 분리).
+
+따라서 PEMemAllocator 의 첫 일은 **이 PE 의 HBM slice 와 사용자
+TCM 영역을 각각 단일 free block 으로 세우는 것**.
+
+### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
+
+생성 즉시 `self._va_base = va_base`, `self._va_size = va_size`,
+`self._page_size = page_size`, `self._used = 0`, `self._free = [(va_base,
+va_size)]`. 첫 일은 **VA base 부터 size 까지 single block 으로 세우고
+page_size 를 회수**.
+
+## Context
+
+`runtime_api/context.py::_ensure_allocators` 는 다음 단계로 allocator 세트를
+구성한다:
+
+1. spec 으로부터 `hbm_total_gb_per_cube`, `hbm_slices_per_cube`,
+   `tcm_size_mb`, target_device 별 SIP 범위 등을 읽음.
+2. `AddressConfig` 로 모든 파라미터를 frozen 하게 패킹.
+3. target SIP 범위 × cube × PE 의 모든 조합에 대해
+   `PEMemAllocator(sip, cube, pe, cfg)` 인스턴스를 1개씩 생성.
+4. `VirtualAllocator(va_base=0x1_0000_0000, va_size=64 GiB,
+   page_size=pe_mmu.page_size)` 를 1개 생성.
+
+allocator 들의 책임:
+
+- **PEMemAllocator**: PE-로컬 HBM slice / TCM 의 PA-공간 할당 (PhysAddr
+  encoding 까지 포함).
+- **VirtualAllocator**: device-wide VA 공간을 페이지 정렬로 할당. 이후
+  `RuntimeContext._create_tensor` 가 VA → PA 매핑을 `MmuMapMsg` 로 fabric
+  에 push.
+
+이 알고리즘들은:
+
+- **first-fit** 으로 단순.
+- 자유 블록 리스트는 **offset 정렬 (sorted by start)** 유지.
+- `free()` 시 **양쪽 인접 블록과 coalesce**.
+
+이런 결정의 근거가 어디에도 없으므로, 향후 누군가 "왜 best-fit 이 아닌가",
+"왜 buddy allocator 가 아닌가", "왜 partial overlap free 가 silently
+허용되는가" 라는 질문에 답할 기준이 필요. 본 ADR 이 그 기준을 마련한다.
+
+## Decision
+
+### D1. `_FreeList` — offset-기반 first-fit + coalescing
+
+`policy/address/allocator.py::_FreeList`:
+
+- 내부 표현: `list[tuple[int, int]]` = `[(start_offset, size), ...]` —
+  start offset 으로 정렬된 자유 블록의 sorted list.
+- `alloc(nbytes)`:
+  1. free list 를 앞에서부터 순회 (first-fit).
+  2. 처음 만나는 `size >= nbytes` 인 블록에서 앞부분을 잘라 사용.
+  3. 정확히 일치하면 블록 통째로 제거; 아니면 `(start+nbytes, size-nbytes)`
+     로 축소.
+  4. `_used += nbytes`, 잘라낸 `start` 반환.
+  5. 맞는 블록이 없으면 `AllocationError("overflow ... largest free block
+     ...")`.
+- `free(offset, nbytes)`:
+  1. `_used -= nbytes`.
+  2. `bisect_left(self._free, (offset,))` 로 삽입 위치 결정.
+  3. 직전 블록과 인접 (`prev_start + prev_size == offset`) 하면 흡수.
+  4. 직후 블록과 인접 (`offset+nbytes == next_start`) 하면 흡수.
+  5. coalesced range 를 정렬 위치에 insert.
+
+이 알고리즘은 fragmentation 에 약점이 있으나 (best-fit / buddy 대비), 본
+시뮬레이터의 워크로드 특성상 (deploy/free 패턴이 거의 stack-like) 충분
+하다는 것이 디자인 가정이다. 워크로드가 변하면 D1 supersede 후보.
+
+### D2. partial overlap free 는 **검사하지 않는다**
+
+`_FreeList.free(offset, nbytes)` 는 호출자가 정확한 (offset, nbytes) 를
+넘긴다고 신뢰한다. 다음을 검증하지 않는다:
+
+- 그 range 가 실제로 alloc 된 것인지.
+- 그 range 가 다른 alloc 된 영역과 겹치지 않는지.
+
+이유: 시뮬레이터 컨텍스트에서 호출자는 항상 `alloc()` 의 반환값을 그대로
+저장했다가 `free()` 에 넘기는 패턴이며, 외부 사용자 입력이 아니다. 안전성
+검사를 추가하면 매 free 마다 O(N) 비용이 들어 시뮬 wall-clock 에 영향.
+
+이 신뢰 모델이 깨지면 (예: 두 텐서가 같은 PA 를 가리키는 코드 경로 도입)
+즉시 ADR-level 으로 재검토.
+
+### D3. `PEMemAllocator` — HBM/TCM 두 채널 분리
+
+`PEMemAllocator(sip_id, die_id, pe_id, cfg)` 는 두 `_FreeList` 를 보유:
+
+- `_hbm`: `cfg.hbm_slice_bytes` 크기.
+- `_tcm`: `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
+  tcm_scheduler_reserved_bytes`) 크기.
+
+`alloc_hbm(nbytes) -> PhysAddr`:
+
+- `_hbm.alloc(nbytes)` 로 offset 획득.
+- `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id, pe_local_hbm_offset=offset,
+  slice_size_bytes=cfg.hbm_slice_bytes)` 로 PA 인코딩.
+- 실패 시 `AllocationError("HBM overflow ...")`.
+
+`free_hbm(pa, nbytes)`:
+
+- `pa.hbm_offset - pe_id * cfg.hbm_slice_bytes` 로 PE-local offset 복원.
+- `_hbm.free(offset, nbytes)`.
+
+`alloc_tcm(nbytes) -> PhysAddr`: 유사하게 `PhysAddr.pe_tcm_addr` 로 인코딩.
+
+`free_tcm(pa, nbytes)`: `pa.sub_offset` 을 그대로 사용 (TCM 은 PE-local
+offset 이 곧 sub_offset).
+
+scheduler-reserved TCM 영역 (`cfg.tcm_scheduler_reserved_bytes`) 은
+allocator 가 인지하지 않는다 (`_tcm` 의 capacity 에서 사전 차감되어 있음).
+이는 ADR-0014 의 PE_SCHEDULER 내부 buffer 예약과 정합된다.
+
+### D4. `VirtualAllocator` — 페이지 정렬 first-fit + coalescing
+
+`policy/address/va_allocator.py::VirtualAllocator`:
+
+- 내부 표현: `_FreeList` 와 동일한 sorted `list[tuple[int, int]]`.
+  최초: `[(va_base, va_size)]`.
+- `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
+- `alloc(nbytes) -> int`:
+  1. `aligned = _align_up(nbytes)`.
+  2. first-fit 으로 `size >= aligned` 인 블록 탐색.
+  3. 블록 앞부분 `aligned` 만큼 잘라 사용. 정확히 일치하면 제거.
+  4. `_used += aligned`. 블록 `start` (= aligned 된 VA) 반환.
+  5. 실패 시 `VaAllocationError`.
+- `free(va, nbytes)`: `_align_up(nbytes)` 단위로 free. _FreeList 와 동일한
+  coalesce 알고리즘.
+
+`page_size` 의 실제 값은 두 곳에서 다른 기본을 갖는다:
+
+- `VirtualAllocator.__init__` 의 매개변수 기본값: `2 MiB`. 직접 호출하는
+  테스트가 그대로 받는다.
+- `RuntimeContext._ensure_allocators` 가 인스턴스화할 때:
+  `pe_mmu.attrs.get("page_size", 4096)` — `topology.yaml` 의
+  `pe_mmu.attrs.page_size` 가 있으면 그 값, 없으면 fallback 4 KiB.
+
+두 기본이 다른 이유: VirtualAllocator 의 standalone 기본은 ADR-0039 의
+PE_MMU stopgap 기본 (2 MiB) 과 정합되어 직접 테스트가 자연스럽고, context
+fallback 의 4 KiB 는 topology 미설정 시 안전한 minimum page 다. 실제 사용
+경로는 항상 후자이며 (`_ensure_allocators` 가 인스턴스화하므로),
+`topology.yaml` 에서 `page_size` 가 명시되면 그 값이 양쪽 (MMU + VA
+allocator) 으로 일관되게 흐른다.
+
+만약 이 일치가 깨지면 (예: VirtualAllocator 의 page_size 를 PE_MMU 와
+다르게 인스턴스화) MMU `map()` 가 서브-페이지 region 모드 (ADR-0039 D3) 로
+흐른다.
+
+VA 기본 범위: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64 GiB`. 이
+값은 `_ensure_allocators` 에 하드코딩되어 있으며 ADR-0011 의 VA 모델에서
+직접적인 의미를 갖지는 않는다 — 단지 host 코드와 충돌하지 않을 만큼 큰
+주소 공간을 device-wide 로 잡아둔 것.
+
+### D5. allocator 인스턴스의 lifecycle
+
+- `RuntimeContext._ensure_allocators` 가 lazy 하게 호출됨 (`_create_tensor`
+  의 첫 호출 시점).
+- 한 번 생성된 allocator dict (`self._allocators`) 는 RuntimeContext 의
+  lifetime 동안 재사용. 같은 process 안의 두 번째 deploy 는 새 객체를
+  만들지 않는다.
+- `RuntimeContext.cleanup()` 이 모든 living tensor 의 `_free_tensor()` 를
+  호출 → MMU unmap + `va_allocator.free` + `pemem_allocator.free_hbm` 으로
+  free list 가 원상복구. 다음 RuntimeContext 가 다시 만들면 초기 상태부터.
+
+allocator 상태가 RuntimeContext 간에 공유되지 않는 점이 단일 process 안의
+연속 실행에서 deploy → cleanup → deploy 의 결정성을 보장한다.
+
+### D6. Allocator 실패는 raise 한다 (silent OOM 금지)
+
+`_FreeList.alloc` / `VirtualAllocator.alloc` 모두 충분한 free block 이
+없으면 `AllocationError` / `VaAllocationError` 를 던진다. 메시지에는
+"required size + largest available block" 가 포함되어, fragmentation
+인지 진짜 OOM 인지 진단 가능.
+
+silent fallback (예: 가장 큰 블록만큼만 alloc) 는 절대 금지 — 부분 할당된
+텐서가 SimPy 단계에 들어가면 라우팅·DMA 가 잘못된 PA 를 인지하여 시뮬
+정확도가 깨진다.
+
+### D7. address space 와 allocator 의 1:1 대응
+
+물리 주소 공간 분리는 PhysAddr 의 sub-unit (ADR-0001 D2.3) 으로 표현되며,
+각 sub-unit 마다 별도 allocator 인스턴스를 둔다:
+
+- HBM slice → `PEMemAllocator._hbm`.
+- PE TCM → `PEMemAllocator._tcm`.
+- (현재 미사용) M_CPU local memory, CUBE SRAM → 별도 allocator 필요. 현재
+  구현은 아직 IPCQ-only slot 으로 처리 (ADR-0023 D9.7) 하며 PA 공간을
+  share 하지 않으므로 별도 free-list 가 없음.
+
+cube-level SRAM allocator 가 필요해지면 `_FreeList(cfg.sram_bytes_per_cube)`
+인스턴스를 cube 단위로 추가한다 (`cfg.sram_bytes_per_cube` 는 이미
+`AddressConfig` 에 정의되어 있어 데이터 모델은 준비됨).
+
+## Alternatives Considered
+
+### A1. best-fit / buddy allocator
+
+기각 (현재). 워크로드의 alloc/free 패턴이 stack-like (deploy 순서 = free
+순서) 라 first-fit + coalescing 으로 fragmentation 이 충분히 통제된다.
+LLM kernel sweep 에서 long-running fragmentation 이 관찰되면 buddy 로
+교체하는 ADR 을 별도로 만든다.
+
+### A2. partial overlap free 검증 추가
+
+기각. D2 의 신뢰 모델 + O(N) 검사 비용. 단, 디버그 모드 (`KERNBENCH_DEBUG`
+env var 등) 에서 활성화하는 옵션은 후속 작업으로 가능.
+
+### A3. VA 와 PA 의 통합 allocator
+
+기각. VA 공간 (64 GiB device-wide) 과 PA 공간 (slice 별 ~6 GiB) 는 의미
+차원이 다르다. VA 는 host kernel 의 view, PA 는 device sub-unit 의 view.
+ADR-0011 의 VA 모델 정신 (MMU 가 둘 사이를 매핑) 과 정합하기 위해
+allocator 도 분리.
+
+### A4. page_size 의 multi-tier 지원 (large page + small page)
+
+기각 (현재). 단일 page_size (현재 2 MiB) 가 LLM kernel 의 텐서 단위 (수
+MiB~수 GiB) 에 맞고, ADR-0039 D3 의 서브-페이지 region 으로 작은 매핑이
+필요할 때 흡수된다. multi-tier page 는 MMU 자체 모델을 확장해야 하므로
+별도 ADR 후보.
+
+## Consequences
+
+- allocator 알고리즘이 ADR-level 에서 굳어져 (D1·D3·D4), 새로운 시뮬
+  시나리오에서 fragmentation 이슈가 발생할 때 "여기서 first-fit + coalesce
+  를 쓰고 있다" 가 명확.
+- D2 의 신뢰 모델이 명시되어, 향후 사용자 입력으로부터 직접 alloc/free 를
+  받는 경로가 도입되면 본 ADR supersede 가 필요함을 일찍 인지 가능.
+- D7 의 sub-unit별 allocator 1:1 대응이 명시되어, M_CPU/SRAM 별도 영역이
+  필요해질 때 어디에 free-list 를 추가해야 하는지 명확.
+- `VirtualAllocator` 의 page_size 가 PE_MMU 설정과 일치해야 함이 D4 에
+  적혀 있어, 향후 topology.yaml 의 page_size 변경 시 ADR-0039 stopgap 동작
+  과의 상호작용을 빠르게 가늠 가능.
@@ -0,0 +1,231 @@
+# ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
+
+## Status
+
+Accepted (2026-05-22).
+
+`probes/probe.py` 의 `run_probe(...)` 가 노출하는 traffic-pattern catalog,
+formula vs actual 비교, 그리고 monotonicity / D2H≥H2D 같은 invariant
+체크의 의미를 명시한다. ADR-0010 (CLI surface) 가 `kernbench probe`
+subcommand 를 enumerate 하나, **probe 가 실제로 측정하는 것**과 **어떤
+invariant 를 PASS/FAIL 로 판정하는가**는 ADR-level 에 없었다.
+
+## First action (제일 처음에 하는 일)
+
+`run_probe(topology_path, case_filter=None)` 의 첫 4가지 작업:
+
+1. `Path(topology_path).expanduser().resolve()` 로 절대 경로 산출.
+2. `load_topology(path)` → `TopologyGraph` 인스턴스 (그래프 + spec).
+3. `_build_edge_map(graph)` → `{(src, dst): Edge}` 빠른 lookup 테이블.
+4. `AddressResolver(graph)` + `PathRouter(graph)` 인스턴스화.
+
+그 다음 `nbytes = 32768` (= 32 KiB, summary table 의 기준 데이터 크기) 와
+`show_all = (case_filter is None or case_filter == "all")` 를 설정.
+
+즉, **probe 의 첫 일은 "토폴로지를 한 번 로드하여 edge map / resolver /
+router 를 준비하고, 32 KiB 라는 표준 측정 크기를 픽스하는 것"**. 그 이후
+H2D → D2H → PE DMA 세 카테고리의 case 들이 각각 별도의 `GraphEngine`
+인스턴스에서 실행된다 (case 간 cross-talk 차단).
+
+## Context
+
+`kernbench probe` 는 다음 의도로 도입된 verification 도구다:
+
+- **수동 분석 ground truth**: 실 시뮬레이션 (`kernbench run --bench ...`)
+  결과의 latency 가 비정상으로 보일 때, 단순 traffic pattern 의 정답을 별도
+  로 얻어 비교.
+- **formula vs actual 비교**: 분석 모델 (wire latency + overhead + drain)
+  과 시뮬레이션 결과 (`total_ns`) 가 일치하는지 확인. 일치하지 않으면 모델
+  단순화 가정 (ADR-0033) 어디가 빠진 것인지 단서.
+- **monotonicity check**: hop 수가 늘면 latency 가 단조 증가해야 한다는
+  invariant 의 자동 확인.
+- **utilization sweep**: 데이터 크기 (4 KiB ~ 1 MiB) 별 BW 활용률 표.
+
+이 도구의 동작 사양이 ADR-level 에 없으면:
+
+- 다른 형식의 traffic pattern (예: MCpuDma, IPCQ) 을 추가하려는 사람이 기존
+  카테고리의 표 포맷 / 측정 단위를 일관되게 따르기 어렵다.
+- monotonicity 가 무엇을 기준으로 검사되는지 (hop 수? cube 거리? wire
+  길이?) 모호.
+- 32 KiB 라는 기준 크기와 `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]` sweep
+  의 의미가 코드 grep 으로만 확인 가능.
+
+## Decision
+
+### D1. 세 가지 case category — H2D / D2H / PE DMA
+
+각 category 는 토폴로지 상 별개의 데이터 경로를 가지며, 별도의 summary
+table + sweep table + route detail block 으로 출력된다.
+
+- **H2D (Host→Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
+  dst_pe=0, pattern="zero")` 가 `pcie_ep → io_cpu → m_cpu → hbm_ctrl` 경로
+  를 흐른다. cube 인덱스로 hop 수가 증가:
+  - h2d-1hop: cube=0, hops=1
+  - h2d-2hop: cube=4, hops=2
+  - h2d-3hop: cube=8, hops=3
+  - h2d-4hop: cube=12, hops=4
+- **D2H (Device→Host Read)**: `MemoryReadMsg(src_sip=0, src_cube, src_pe=0)`.
+  forward command path + reverse data path 의 합 latency. 같은 4 hops
+  카테고리.
+- **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe, dst_pa)`.
+  5 가지 케이스로 cube/PE 위치 변화:
+  - pe-local-hbm: same cube, same PE
+  - pe-same-half-hbm: same cube, different PE (PE 1)
+  - pe-cross-half-hbm: same cube, far PE (PE 4)
+  - pe-cross-cube-hbm-best: adjacent cube (cube 1)
+  - pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
+
+cube 인덱스가 4/8/12 (H2D), 1/4/15 (PE DMA) 같이 의미 있는 이유는
+4x4 cube mesh (sip.cube_mesh.w=4, h=4) 에서의 거리 정의 — 추후 cube_mesh
+크기 변경 시 이 값들이 같이 갱신되어야 한다.
+
+### D2. 표준 측정 크기 — `nbytes = 32768` (32 KiB)
+
+모든 case 의 summary table 은 `nbytes=32768` 로 한 번 실행한 결과를
+보여준다. 32 KiB 가 선택된 이유:
+
+- DMA overhead 와 BW drain 이 한쪽으로 치우치지 않는 적당한 크기.
+- 다수 sub-unit (TCM, register file) 의 1회 transfer 단위와 비교 가능.
+
+크기별 utilization 변화는 별도 sweep table 이 보여준다 (D3).
+
+### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
+
+`SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`, `SWEEP_LABELS =
+["4KB", "16KB", "64KB", "256KB", "1MB"]`. 매 size 마다 다음 공식:
+
+```
+drain   = nbytes / bottleneck_bw
+total   = overhead + wire + drain
+eff_bw  = nbytes / total
+util%   = eff_bw / bottleneck_bw × 100
+```
+
+`bn_bw is None or <= 0` 이면 그 컬럼은 0.0 % 로 출력. 의미: hop 수가 늘
+수록 작은 transfer 는 overhead-bound, 큰 transfer 는 drain-bound 가 되는
+패턴을 한 표에서 확인.
+
+### D4. 측정 항목 — actual / formula / breakdown
+
+각 case 행에 표시되는 컬럼:
+
+- `Actual` (total_ns): SimPy 실행 결과의 `trace["total_ns"]`.
+- `Ovhd`: 경로상 모든 node 의 `node.attrs["overhead_ns"]` 합 (formula
+  breakdown).
+- `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
+- `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
+- `Ovhd%` / `Drain%`: Ovhd/Drain 이 Actual 에서 차지하는 비율 (formula 의
+  Wire 는 통상 매우 작아 표시하지 않음).
+- `Eff.BW`: `nbytes / total_ns` (실 측정 BW).
+- `BN.BW`: bottleneck bandwidth (formula). path 상 모든 edge 의 BW 중 최소.
+  edge BW 가 없으면 "-".
+- `Util%`: `Eff.BW / BN.BW × 100`. 100% 면 single-stream BW upper bound 에
+  도달.
+
+formula 의 합 (`wire + ovhd + drain`) 과 actual 의 차이가 크면 모델
+단순화가 잡지 못하는 요소가 있다는 신호 (ADR-0033 의 가정 점검).
+
+### D5. Invariant 자동 체크 — PASS/FAIL
+
+다음 invariant 들이 자동으로 확인되어 `[v] PASS` / `[x] FAIL` 로 출력:
+
+- **H2D / D2H monotonic increase**: hop 수가 늘면 actual latency 가
+  단조 증가해야 함. `all(lats[i] < lats[i+1] for ...)`.
+- **D2H ≥ H2D**: 같은 hop 인덱스에서 D2H ≥ H2D (D2H 는 forward command
+  + reverse data 두 leg 이므로). `all(d2h[i].total >= h2d[i].total)`.
+- **PE DMA best < worst**: cross-cube best (adjacent) latency < cross-cube
+  worst (diagonal) latency.
+- **PE DMA local vs remote**: local BN BW vs remote BN BW 의 비교 출력
+  (PASS/FAIL 이 아닌 정보성).
+
+체크가 FAIL 이면 사람이 즉시 모델/토폴로지 회귀를 인지할 수 있도록 한
+줄로 분명하게 출력.
+
+### D6. Route detail — per-hop timestamp trace
+
+summary 와 sweep 표 이후 각 case 의 path 와 per-hop 누적 시간 (
+`_hop_timestamps`) 가 별도 섹션에서 출력된다:
+
+- H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
+  (`m_cpu → hbm_ctrl`) + per-hop trace.
+- D2H: forward (cmd, no data) + reverse (data) trace 분리 표시.
+- PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
+
+각 hop 의 timestamp 는 cumulative `wire_ns + overhead_ns` 누적. terminal
+hop 의 annotation 에 `drain:Xns` 가 붙는다. bottleneck edge 는
+`<BN:XXGB/s>` 로 표시되어 시각적으로 식별 가능.
+
+### D7. case_filter 인자의 의미
+
+- `None` 또는 `"all"`: 모든 case 실행 (default).
+- 다른 문자열: 그 이름과 정확히 일치하는 case 만 실행. 예: `kernbench
+  probe --case h2d-2hop`.
+
+각 카테고리 안에서 `name != case_filter` 면 skip 되며, 그 카테고리의
+monotonicity / D2H≥H2D 비교는 데이터가 1개일 때 자연히 skip 된다.
+
+CLI parser 의 `--case` 기본값은 `"all"`이라 인자 생략 시 전체 실행.
+
+### D8. 매 case 별 fresh GraphEngine
+
+H2D 4개, D2H 4개, PE DMA 5개의 case 가 각각 **새로운 GraphEngine**
+인스턴스에서 실행된다 (`engine = GraphEngine(graph)`). 이유:
+
+- case 간 누적 상태 (op_log, completion 추적, allocator 등) 가 cross-talk
+  하지 않도록 격리.
+- 한 case 의 traffic 이 다른 case 의 BW 측정에 영향을 주지 않도록 보장.
+
+이 격리는 probe 의 측정 결과를 **각 case 단독 single-flow** 의 latency 로
+해석할 수 있게 한다. multi-flow contention 측정은 별도 도구 (예:
+`pe2pe_overview` 플롯, ADR-0033 의 multi-flow merging 모델) 책임.
+
+### D9. 출력 포맷의 안정성
+
+probe 의 stdout 출력은 사람이 읽기 위함이며, 정확한 컬럼 폭/구분자/공백 은
+machine-readable contract 가 아니다. 자동화된 도구가 probe 결과를 파싱
+하려면 별도 JSON 출력 모드를 추가해야 한다 (현재 미구현).
+
+PASS/FAIL 줄의 `[v]` / `[x]` 접두사는 CI grep 용 anchor 로 안정 보장.
+
+## Alternatives Considered
+
+### A1. Probe 를 별도 bench 로 등록 (`@bench(name="probe")`)
+
+기각. probe 는 bench 가 아니라 verification 도구로 의도된다 — sweep / 분석
+용 multi-engine 실행과 invariant PASS/FAIL 출력이 본질이며, ADR-0045 의
+"단일 디바이스 + 단일 RuntimeContext" bench 모델과 맞지 않는다.
+
+### A2. monotonicity 위반 시 exit code 1
+
+기각 (현재). 인간 검사 도구 위주로 의도되어 있어 PASS/FAIL 줄을 출력하고
+exit 0 로 종료. CI 가 violation 으로 fail 하길 원하면 별도 wrapper 가
+`grep "\[x\]"` 결과로 판단하면 됨. 후속으로 strict-mode flag (`--strict`)
+도입 가능.
+
+### A3. probe 의 case 정의를 외부 YAML 로
+
+기각 (현재). 8개 case (4 H2D + 4 D2H + 5 PE DMA — 합 13개) 는 코드에
+하드코딩되어 있고 의미가 토폴로지 mesh 구조에 단단히 묶여 있다. 외부
+YAML 로 옮기면 cube 인덱스의 의미 (4, 8, 12 / 1, 4, 15) 를 별도로 문서화
+해야 하므로 응집도 손실. 케이스 추가가 잦아지면 그때 별도 ADR 로 도입.
+
+### A4. multi-flow contention 측정 추가
+
+기각 (probe 범위 밖). D8 에서 명시한 single-flow 격리 모델이 probe 의 핵심
+의도. multi-flow contention 은 ADR-0033 latency model 의 다른 영역으로,
+별도 도구 또는 별도 case category 로 처리.
+
+## Consequences
+
+- probe 의 case catalog (D1) 와 측정 단위 (D2/D3) 가 ADR-level 에서 명시
+  되어, 새 traffic 카테고리 추가 시 어떤 표 포맷을 따라야 하는지 분명.
+- formula vs actual 의 컬럼 의미 (D4) 가 굳어져, probe 결과를 보고 "왜
+  Drain% 가 5% 인가 / 70% 인가" 같은 질문을 빠르게 ADR-0033 가정 점검으로
+  연결 가능.
+- invariant 자동 체크 (D5) 가 ADR 에 굳어져, 향후 latency 모델 변경 시
+  monotonicity / D2H≥H2D 회귀를 probe 가 즉시 잡아낸다는 안전망 정착.
+- D8 의 case 간 격리가 명시되어, probe 결과를 single-flow 측정으로 안전
+  하게 해석 가능. multi-flow 측정이 필요해지면 별도 도구 트랙이 필요함이
+  분명.
+- A2 의 strict-mode flag 가 후속 작업 후보로 기록되어, CI 통합 요구 시
+  최소 추가 작업으로 도입 가능.
@@ -0,0 +1,308 @@
+# ADR-0050: CCL Algorithm Module Contract — `ccl/algorithms/*.py`
+
+## Status
+
+Accepted (2026-05-22).
+
+`src/kernbench/ccl/algorithms/` 디렉터리 안의 모듈이 AHBM CCL backend
+(ADR-0047) 에서 collective algorithm 으로 사용되려면 갖춰야 할 인터페이스,
+kernel 시그너처, 그리고 새 알고리즘 추가 절차를 명시한다. ADR-0047 D3 가
+"algorithm 모듈은 `kernel`, `kernel_args`, optional `TOPO_NAME_TO_KIND` 를
+expose 해야 한다" 라고만 한 줄로 언급하나, **algorithm 모듈 작성자가 따라야
+할 contract** 는 ADR-level 에서 정리된 적이 없다. ADR-0045 가 bench 모듈
+contract 를 다루는 것과 짝을 이룬다.
+
+## First action (제일 처음에 하는 일)
+
+알고리즘 모듈이 import 되는 시점은 두 가지다:
+
+1. **AHBM backend 진입**: 사용자 코드가 `dist.init_process_group(backend="ahbm")`
+   를 호출하면, `AhbmCCLBackend.__init__` 안에서 `self._algo_module =
+   importlib.import_module(self._merged["module"])` 가 실행된다. 이때 모듈
+   레벨에서 가장 먼저 일어나는 일:
+   - `SIP_TOPO_RING/TORUS/MESH` 같은 정수 상수가 모듈 namespace 에 노출.
+   - `TOPO_NAME_TO_KIND` 사전이 모듈 namespace 에 노출 — backend 가
+     `topo_map = getattr(self._algo_module, "TOPO_NAME_TO_KIND", None)` 로
+     조회.
+   - `kernel_args` 함수 정의 — 호출 시 호출자가 사용.
+   - `allreduce_intercube_multidevice` 같은 알고리즘 함수 정의.
+   - 모듈 마지막 줄에서 `kernel = allreduce_intercube_multidevice` 로
+     alias 가 노출.
+
+2. **ccl.yaml install 단계**: `kernbench.ccl.install.install_ipcq` 가 호출
+   되어 IPCQ neighbor table 을 푸시할 때 같은 알고리즘 모듈이 import 됨.
+
+즉, **algorithm 모듈의 첫 일은 "topology-kind 상수, `TOPO_NAME_TO_KIND`
+사전, `kernel_args` 함수, 그리고 `kernel` alias 를 모듈 namespace 에 노출
+하는 것"** 이다. 모든 노출은 import-time 부수효과로 충분하며 별도 초기화
+함수 호출이 필요하지 않다.
+
+## Context
+
+`AhbmCCLBackend` (ADR-0047) 는 process group 초기화 시점에 `ccl.yaml` 의
+`defaults.algorithm` (또는 사용자가 지정한 알고리즘 이름) 으로부터 모듈
+경로를 얻어 dynamic import 한다. backend 는 그 모듈로부터 다음 4 가지를
+기대한다:
+
+- `kernel`: collective 의 진입 함수.
+- `kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple`: kernel 에
+  넘길 위치 인자 묶음.
+- `TOPO_NAME_TO_KIND` (optional): `topology.yaml` 의 `sips.topology`
+  문자열 (예: `"ring_1d"`, `"torus_2d"`, `"mesh_2d_no_wrap"`) 을 정수
+  상수로 매핑하는 dict.
+- (간접) IPCQ neighbor table 설치: `configure_sfr_intercube_multisip` 가
+  알고리즘 모듈의 `TOPO_NAME_TO_KIND` 와 `cube_w/h` 를 보고 SFR 을 결정.
+
+현재 코퍼스의 유일한 algorithm 모듈은 `lrab_hierarchical_allreduce.py`
+(248 줄) 이다. 이름은 "**l**eft-**r**ight **a**lternating **b**roadcast
+**hierarchical allreduce**". 향후 `ring_allreduce`, `tree_allreduce`,
+`broadcast` 같은 모듈이 추가될 때마다 이 contract 를 따라야 일관된
+디스패치가 가능하다.
+
+이 contract 가 ADR-level 에 없으면:
+
+- 새 algorithm 작성자가 ADR-0047 D3 의 한 줄 만으로 시그너처를 추론해야.
+- kernel 함수 인자 순서 (특히 `t_ptr, n_elem, cube_w, cube_h, n_sips,
+  sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl`) 의 의미가 코드
+  grep 없이는 명확하지 않다.
+- `kernel_args` 가 어떤 인자를 받고 어떤 tuple 을 돌려줘야 하는지 관례
+  로만 굳어진다.
+
+## Decision
+
+### D1. algorithm 모듈은 4 가지 public symbol 을 노출한다
+
+```python
+# src/kernbench/ccl/algorithms/<name>.py
+
+from __future__ import annotations
+
+# (필수) topology-kind 상수 — 알고리즘 내부에서 사용
+SIP_TOPO_RING = 0
+SIP_TOPO_TORUS = 1
+SIP_TOPO_MESH = 2
+
+# (선택) topology 이름 → kind 매핑. backend 가 ccl.yaml/topology 의
+# 문자열 SIP topology 를 정수로 변환하는 데 사용.
+TOPO_NAME_TO_KIND = {
+    "ring_1d": SIP_TOPO_RING,
+    "torus_2d": SIP_TOPO_TORUS,
+    "mesh_2d_no_wrap": SIP_TOPO_MESH,
+}
+
+# (필수) kernel 인자 빌더
+def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
+    return (n_elem, cube_w, cube_h, world_size)
+
+# (필수) kernel 함수 (`tl=...` 키워드를 통해 TLContext 가 주입됨)
+def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips,
+                         sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl):
+    ...
+
+# (필수) kernel alias — backend 가 `module.kernel` 로 접근
+kernel = my_allreduce_kernel
+```
+
+- `kernel` alias 는 backend 가 직접 호출하는 entry point 다. 함수 이름이
+  무엇이든 (`allreduce_intercube_multidevice` 처럼) `module.kernel = fn`
+  으로 노출해야 한다.
+- `kernel_args` 가 없으면 backend 가 알고리즘 인자를 만들 방법이 없다.
+  signature 는 D2 참고.
+- `TOPO_NAME_TO_KIND` 가 없으면 backend 는 `sip_topo_kind = 0` 으로
+  fallback 한다. 단일 topology 만 지원하는 알고리즘이라면 생략 가능.
+
+### D2. `kernel_args` 시그너처 — `(world_size, n_elem, *, cube_w, cube_h)`
+
+```python
+def kernel_args(world_size: int, n_elem: int, *,
+                cube_w: int = 4, cube_h: int = 4) -> tuple:
+    return (n_elem, cube_w, cube_h, world_size)
+```
+
+- **위치 인자**: `world_size` (= rank 수), `n_elem` (= 단일 shard 의
+  element 수, f16 기준).
+- **키워드 인자**: `cube_w`, `cube_h` (= cube mesh 크기). default 는
+  4×4 — `topology.yaml` 의 `sip.cube_mesh` 기본값과 정합.
+- **반환**: kernel 의 위치 인자 순서대로 묶은 tuple.
+
+backend 의 `all_reduce` 가 호출 시:
+
+```python
+kernel_args_tuple = self._algo_module.kernel_args(
+    self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h,
+)
+extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)
+pending = self.ctx.launch(
+    self._merged["algorithm"], kernel_fn, tensor,
+    *kernel_args_tuple, *extra_args, _defer_wait=True,
+)
+```
+
+즉 kernel 의 최종 위치 인자는: `(tensor_ptr, *kernel_args_tuple,
+sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` 이며, 거기에 `tl=...` 가
+키워드로 자동 주입된다. `kernel_args` 가 돌려주는 tuple 의 길이/순서는
+**kernel signature 와 1:1 일치** 해야 한다.
+
+### D3. `kernel` 함수 시그너처 — 정형화된 9 + tl 인자
+
+권장 시그너처:
+
+```python
+def my_kernel(
+    t_ptr: int,             # VA base of the row-wise-sharded tensor on this SIP
+    n_elem: int,            # element count per cube tile (or per shard)
+    cube_w: int,            # cube mesh width  (kernel_args 에서 옴)
+    cube_h: int,            # cube mesh height (kernel_args 에서 옴)
+    n_sips: int,            # world_size 와 동일 (rank = SIP, ADR-0024)
+    sip_rank: int,          # 이 SIP 의 rank
+    sip_topo_kind: int,     # TOPO_NAME_TO_KIND lookup 결과
+    sip_topo_w: int,        # SIP mesh width (ring_1d 면 0)
+    sip_topo_h: int,        # SIP mesh height (ring_1d 면 0)
+    *, tl,                  # TLContext (auto-injected)
+) -> None:
+```
+
+`kernel_args` 가 다른 위치 인자 순서를 채택하더라도, kernel 의 **마지막
+4 개 위치 인자는 항상 `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`**
+이며 backend 가 `extra_args` 로 append 한다 (ADR-0047 D5). 이 4 개 인자는
+사용자 정의 algorithm 도 받아야 하지만, 알고리즘이 single-SIP 이라면
+그냥 무시하면 된다.
+
+`tl` 은 위치 인자가 아닌 키워드로 주입된다 — `RuntimeContext.launch` 가
+kernel 호출 직전에 `tl=tl_ctx` 를 추가한다. 따라서 kernel signature 의
+`tl` 은 keyword-only (`*, tl`) 또는 마지막 키워드 매개변수 형태여야
+한다.
+
+### D4. kernel body 의 자유도와 제약
+
+kernel body 안에서 사용 가능한 표면: ADR-0046 D3 의 모든 `tl.*` primitive.
+
+특히 자주 쓰이는 패턴:
+
+- `cube_id = tl.program_id(axis=1)` — 이 PE 가 속한 cube 인덱스.
+- `pe_addr = t_ptr + cube_id * nbytes` — cube-별 tile 의 VA 계산.
+- `acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")` — local 데이터
+  로드.
+- `tl.send(dir=...)` / `tl.recv(dir=..., shape=, dtype=)` — IPCQ
+  collective.
+- `acc = acc + recv` — TensorHandle 산술 연산자 (ADR-0046 D4).
+- `tl.store(pe_addr, acc)` — 결과 저장.
+
+kernel body 는 일반 Python 함수이며, branching/looping 자유. 단:
+
+- SimPy `yield` 또는 `async` 금지 (ADR-0046 D1).
+- TensorHandle 의 `.data` 직접 접근 금지 — phase 1 timing 모델은
+  데이터 의존을 모른다 (ADR-0020 의 2-pass 분리).
+- kernel 실행은 deterministic 해야 한다 — 같은 입력으로 두 번 실행하면
+  같은 op 시퀀스 발사. random / external IO 금지.
+
+### D5. SIP topology semantics — `sip_topo_kind` 의 의미
+
+backend 가 `topology.yaml` 의 `system.sips.topology` 문자열을 algorithm
+모듈의 `TOPO_NAME_TO_KIND` 로 lookup 하여 `sip_topo_kind` 정수로 변환.
+algorithm 은 이 정수를 보고 분기:
+
+```python
+if sip_topo_kind == SIP_TOPO_RING:
+    acc = _inter_sip_ring(...)
+elif sip_topo_kind == SIP_TOPO_TORUS:
+    acc = _inter_sip_torus_2d(...)
+elif sip_topo_kind == SIP_TOPO_MESH:
+    acc = _inter_sip_mesh_2d(...)
+```
+
+각 topology branch 는 IPCQ direction 이름 (예: `"global_E"`, `"W"`, `"S"`,
+`"N"`) 을 통해 peer 와 통신. direction 의 의미는 ADR-0023/0025 가 정의
+하며, `configure_sfr_intercube_multisip` 가 IPCQ neighbor table 을 그에
+맞춰 설치한다.
+
+algorithm 모듈은 자기가 지원하지 않는 topology kind 가 들어오면 silent
+no-op 으로 두기보다 명시적으로 `raise ValueError(f"unsupported topology
+kind {sip_topo_kind}")` 하는 것을 권장 — 실수로 backend 에 잘못 dispatch
+된 경우 빠르게 fail.
+
+### D6. ccl.yaml 의 algorithm entry 구조
+
+algorithm 모듈은 `ccl.yaml` 의 entry 와 짝을 이룬다 (ADR-0023 D10 +
+ADR-0047 D3):
+
+```yaml
+defaults:
+  algorithm: lrab_hierarchical_allreduce
+  n_elem: 8
+
+algorithms:
+  lrab_hierarchical_allreduce:
+    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
+    # optional: world_size override
+    # optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip
+```
+
+- `module`: full Python module path. backend 의 `importlib.import_module`
+  가 이 문자열을 그대로 사용.
+- `world_size` (optional): 명시되면 topology fallback 을 override
+  (ADR-0047 D2).
+- algorithm-specific parameters 는 `configure_sfr_intercube_multisip` 가
+  소비.
+
+새 algorithm 추가 시:
+
+1. `src/kernbench/ccl/algorithms/<name>.py` 작성 (D1 컨벤션).
+2. `ccl.yaml` 의 `algorithms` 섹션에 entry 추가.
+3. (필요 시) `kernbench.ccl.sfr_config` 에 SFR 설치 분기 추가.
+4. test 추가 (예: `tests/sccl/test_<name>.py`, ADR-0043 의 eval harness
+   확장).
+
+### D7. legacy "rank = flat PE index" 모드
+
+ADR-0047 D2 가 명시한 `ccl.yaml` 의 `world_size` override 경로는 legacy
+"rank = flat PE index" 테스트가 사용한다. algorithm 모듈은 이 모드 에서도
+`n_sips=world_size` 만큼의 rank 가 들어옴을 가정하면 된다 — backend 가
+rank↔(SIP, cube, PE) 매핑을 사전에 분리해 두므로 algorithm 본체에서는
+modal 분기가 필요 없다.
+
+단, single-cube workload 에서는 `cube_w=cube_h=1` 이 들어와 mesh-기반
+phase 들이 skip 되도록 작성해야 한다 (`lrab_hierarchical_allreduce.py`
+의 `single_cube = (cube_w == 1 and cube_h == 1)` 패턴 참고).
+
+## Alternatives Considered
+
+### A1. algorithm 모듈을 class 로 구조화 (`class Allreduce: kernel(...)` 등)
+
+기각. Python 모듈 namespace 자체가 algorithm 의 identity 로 사용 중이며
+(ADR-0047 D3 의 `importlib.import_module`), class 한 겹은 추가 indirection
+만 늘리고 dispatch 측 코드를 두텁게 만든다. 모듈-레벨 free function
+ `kernel` alias 패턴이 충분히 명확.
+
+### A2. `kernel_args` 를 명시적 dataclass 로 typing
+
+기각 (현재). algorithm 마다 인자 갯수가 다른 것이 정상이며, dataclass 한
+종류를 강제하면 다양한 algorithm 간 호환이 어려워진다. tuple 반환은 simple
+하고 backend 측 `*kernel_args_tuple` unpacking 과 깨끗이 맞물린다.
+algorithm 별 자체 타입 강도가 필요해지면 그 algorithm 모듈 안에서 NamedTuple
+사용은 자유.
+
+### A3. SFR 설치를 algorithm 모듈 안으로
+
+기각. SFR 설치 (`configure_sfr_intercube_multisip`) 는 topology + algorithm
+모두를 보고 IPCQ neighbor table 을 설치하는 cross-module 결정이라, algorithm
+모듈 내부보다 `kernbench.ccl.sfr_config` 같은 전용 위치가 자연스럽다. D6 의
+"필요 시 sfr_config 분기 추가" 워크플로우가 책임 분리 측면에서 더 명확.
+
+### A4. algorithm name 을 모듈 namespace 에 자동 등록 (decorator)
+
+기각. ADR-0045 (bench) 와 달리 algorithm 은 ccl.yaml entry 와 직접 묶여
+있어 추가 등록 레지스트리가 중복이다. `module` 문자열 매핑 하나면 충분.
+
+## Consequences
+
+- ADR-0047 D3 의 한 줄 contract 가 D1–D7 의 작성자-친화적 가이드로 확장
+  되어, 새 algorithm 추가 시 시그너처를 grep 으로 추론할 필요 없음.
+- D3 의 9 + tl 인자 시그너처가 표준화되어, backend 의 `extra_args` append
+  (ADR-0047 D5) 와 자연스럽게 맞물림. 향후 single-SIP-only algorithm 도
+  4 개의 sip_* 인자를 받아야 함이 명시.
+- D5 의 fail-loud 권장으로, ccl.yaml 의 topology 가 algorithm 미지원
+  topology 로 잘못 설정되면 backend 가 silent wrong-result 가 아닌
+  ValueError 로 fail.
+- D6 의 단계별 추가 절차가 명시되어, 새 algorithm 추가가 sfr_config /
+  test / ccl.yaml 어디까지 손대야 하는지 분명.
@@ -0,0 +1,267 @@
+# ADR-0051: Routing Helper API — `AddressResolver` + `PathRouter`
+
+## Status
+
+Accepted (2026-05-22).
+
+`policy/routing/router.py` 가 노출하는 두 helper 클래스
+(`AddressResolver`, `PathRouter`) 의 모든 public API, 인자, 반환 값,
+그리고 네 가지 다른 adjacency graph 의 사용처를 명시한다. ADR-0002 가
+routing distance 와 ordering, bypass 규칙을 정의하나, **helper API 표면
+자체** 는 ADR-level 에 정리된 적이 없다.
+
+## First action (제일 처음에 하는 일)
+
+### `AddressResolver(graph)`
+
+생성 즉시 다음 두 가지를 캐시한다:
+
+1. `self._node_ids = set(graph.nodes)` — 모든 node id 의 set (lookup 용).
+2. `self._hbm_slice_bytes = hbm_total_gb * (1 << 30) // slices_per_cube` —
+   `graph.spec.cube.memory_map` 으로부터 산출 (기본 `48 GB / 8 slices = 6
+   GB`). 이 값이 `resolve()` 가 HBM PA 의 `hbm_offset` 에서 `pe_id` 를
+   복원하는 데 쓰인다.
+
+즉, **AddressResolver 의 첫 일은 "전체 node id 집합과 HBM slice 크기를
+미리 계산해 두는 것"** 이다. graph 자체는 보유하지 않는다.
+
+### `PathRouter(graph)`
+
+생성 즉시 **네 개의 별도 adjacency graph 를 동시 구축**한다:
+
+1. `self._adj_all`: 모든 edge 포함 (component-to-component routing 용).
+2. `self._adj`: `kind != "command"` 인 edge 만 (PE DMA / 일반 data path).
+3. `self._adj_mcpu_dma`: `_MCPU_DMA_EXCLUDE = {"pe_internal",
+   "pe_to_router"}` 를 제외 (M_CPU DMA 가 PE pipeline 노드로 잘못 라우팅
+   되지 않게).
+4. `self._adj_local`: `_UCIE_KINDS` 8 종을 제외 (cube-local routing 용 —
+   UCIe 가 zero-distance bus 처럼 보여 Dijkstra 가 mesh 보다 선호하는
+   것을 막음).
+
+각 그래프는 `defaultdict(list)` of `(neighbor, weight)` 형태이며,
+`edge.routing_weight_mm or edge.distance_mm` 이 weight 로 쓰인다.
+
+즉, **PathRouter 의 첫 일은 "topology edge 들을 4개의 다른 정책으로 동시
+분류하여 4 개의 인접 리스트로 구축하는 것"**. 매 `find_*()` 호출 시 적절
+한 그래프를 골라 Dijkstra 를 돌린다.
+
+## Context
+
+`policy/routing/router.py` 는 다음 두 책임을 함께 수행한다:
+
+- **이름 매핑**: 토폴로지 명명 규칙 (`sip{S}.cube{C}.<comp>`,
+  `sip{S}.io{I}.pcie_ep` 등) 의 단일 소유자. 컴포넌트 / probe / IPCQ
+  install / runtime API 가 이름 문자열을 직접 만들지 않고 helper 를 호출.
+- **경로 결정**: edge 의 `kind` 에 따른 정책 분리. 같은 src→dst 라도
+  routing 의도 (PE DMA vs M_CPU DMA vs general component routing) 에 따라
+  다른 adjacency 를 사용해야 결과가 달라진다.
+
+이 helper API 가 코드 전반에서 광범위하게 소비되는데도 (probe.py /
+distributed.py / install.py / 각종 component / tests), ADR-level 에서
+**정확한 시그너처 / 반환 의미 / 어떤 adjacency 를 쓰는지** 가 한 곳에
+정리되어 있지 않다. 본 ADR 이 그 빈자리를 채운다.
+
+## Decision
+
+### D1. `AddressResolver` 의 5 개 public API
+
+#### D1.1. `resolve(addr: PhysAddr) -> str`
+
+`PhysAddr` 인스턴스를 토폴로지의 destination node id 로 변환.
+
+```
+addr.kind == "hbm"             → f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
+  where pe_id = addr.hbm_offset // self._hbm_slice_bytes  (ADR-0017 D4/D9)
+
+addr.kind == "pe_resource":
+  addr.unit_type == PE         → f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
+  addr.unit_type == SRAM       → f"sip{s}.cube{d}.sram"
+  addr.unit_type == MCPU       → f"sip{s}.cube{d}.m_cpu"
+  그 외                          → RoutingError("unsupported unit_type")
+
+다른 kind                       → RoutingError("unsupported address kind")
+```
+
+산출된 node id 가 `self._node_ids` 에 없으면 `RoutingError(f"node {node_id}
+not found in topology")`. 즉, address 의 syntax 가 valid 해도 topology 에
+실제로 매핑되는 노드가 없으면 fail-loud.
+
+#### D1.2. `find_m_cpu(sip, cube) -> str`
+
+`f"sip{sip}.cube{cube}.m_cpu"`. 없으면 `RoutingError`.
+
+#### D1.3. `find_pcie_ep(sip, io_id="io0") -> str`
+
+`f"sip{sip}.{io_id}.pcie_ep"`. 없으면 `RoutingError`.
+
+#### D1.4. `find_io_cpu(sip, io_id="io0") -> str`
+
+`f"sip{sip}.{io_id}.io_cpu"`. 없으면 `RoutingError`.
+
+#### D1.5. `find_all_pcie_eps() -> list[str]`
+
+전 SIP 의 PCIE_EP node id 를 정렬된 리스트로 반환. `endswith(".pcie_ep")`
+필터링. cross-SIP IPCQ 가 모든 PCIE_EP 를 enumerate 할 때 사용.
+
+명명 규칙 (`sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`) 의 단일
+소유자가 이 클래스다 (ADR-0015 D4). 토폴로지 빌더가 같은 명명 규칙으로
+노드를 만들고, 컴포넌트는 이름 문자열을 절대 직접 구성하지 않는다 —
+모두 helper 를 거친다.
+
+### D2. `PathRouter` 의 4 개 adjacency graph
+
+생성자가 한 번에 구축. edge `kind` 가 정책을 결정:
+
+| graph             | 제외 edge kinds                               | 용도                                       |
+|-------------------|-----------------------------------------------|--------------------------------------------|
+| `_adj_all`        | (none)                                        | M_CPU↔NOC command 포함, IO_CPU/M_CPU routing |
+| `_adj`            | `"command"`                                   | PE DMA / 일반 data path                    |
+| `_adj_mcpu_dma`   | `"pe_internal"`, `"pe_to_router"`            | M_CPU DMA (PE pipeline 우회)               |
+| `_adj_local`      | `_UCIE_KINDS` (`ucie_internal`, `ucie_conn_to_router`, `router_to_ucie_conn`, `ucie_conn_to_noc`, `noc_to_ucie_conn`, `ucie_mesh`, `io_to_cube`, `cube_to_io`) | same-cube routing (UCIe bus 우회) |
+
+각 그래프는 `dict[node_id, list[(neighbor, weight)]]` 이며, weight 는
+`edge.routing_weight_mm or edge.distance_mm`. command edge 의 routing
+영향력을 명시적으로 가르고, UCIe 의 "0-distance bus" 가 mesh 보다 선호
+되는 것을 막기 위한 `_adj_local` 분리가 ADR-0017 D7 의 cross-PE-slice
+mesh-distance 요구와 정합.
+
+### D3. `PathRouter` 의 6 개 public API (+ 2 backward-compat)
+
+#### D3.1. `find_path(src_pe: str, dst_node: str) -> list[str]`
+
+**PE DMA routing**. `src_pe` 는 PE prefix (예: `"sip0.cube0.pe0"`) 이며,
+함수가 `.pe_dma` 를 자동으로 prepend 하여 실제 시작 노드를
+`"sip0.cube0.pe0.pe_dma"` 로 설정.
+
+cube-local 여부 (`_same_cube`) 에 따라 adjacency 선택:
+
+- **same-cube** (src 와 dst 가 `sip{S}.cube{C}.` prefix 공유):
+  `_adj_local` 사용. UCIe 우회를 막아 cross-PE-slice 가 mesh 거리를 정확
+  히 지불 (ADR-0017 D7).
+- **cross-cube**: `_adj` 사용. UCIe 가 자연스럽게 cross-cube path 의
+  최적 선택지로 포함됨.
+
+#### D3.2. `find_path_with_distance(src_pe, dst_node) -> tuple[list[str], float]`
+
+D3.1 과 동일한 adjacency 정책을 사용하나, 결과로 `(path, total_distance)`
+를 함께 반환. probe / 분석 도구에서 distance 메트릭이 필요할 때 사용.
+
+#### D3.3. `find_mcpu_dma_path(m_cpu_id: str, dst_hbm_id: str) -> list[str]`
+
+**M_CPU DMA path**. cube 가 같으면 `_adj_local` (mesh 안에서 마무리), 다르
+면 `_adj_all` (UCIe 경유). `_MCPU_DMA_EXCLUDE` 가 PE pipeline 노드를 자동
+배제하므로, M_CPU 가 PE 의 내부 stage 를 거쳐 routing 되는 잘못된 경로가
+나오지 않는다.
+
+#### D3.4. `find_memory_path(src: str, dst: str) -> list[str]`
+
+`pcie_ep → io_noc → cube → router mesh → hbm_ctrl` 같은 직접 메모리
+경로. `_adj_mcpu_dma` 를 사용하여 `pe_internal` 및 `pe_to_router` edge
+를 제외 — host-issued read/write 가 PE pipeline 으로 새지 않게 보장.
+probe (ADR-0049 D1 의 H2D/D2H case) 에서 직접 호출.
+
+#### D3.5. `find_node_path(src: str, dst: str) -> list[str]`
+
+임의의 두 node 사이의 path. **command edge 포함** (`_adj_all` 사용). M_CPU
+↔ NOC 같은 command-kind link 를 거쳐야 하는 IoCpuComponent /
+MCpuComponent 등이 호출.
+
+#### D3.6. backward-compat shims
+
+- `_dijkstra(start, goal) -> list[str]` — `_run_dijkstra(self._adj, …)`
+  의 thin wrapper.
+- `_dijkstra_with_dist(start, goal) -> tuple[list[str], float]` — distance
+  포함 버전.
+
+언더스코어 prefix 에서 보듯이 내부 API 인 척이지만 기존 테스트가 직접
+호출. 새 코드는 D3.1–D3.5 를 사용하고, 이 두 shim 은 deprecation 후보.
+
+### D4. Dijkstra 알고리즘 — single-source shortest path
+
+`_run_dijkstra_with_dist(adj, start, goal)`:
+
+- `heapq` priority queue.
+- `best: dict[node, distance]` — 노드별 최단 거리 캐시.
+- `prev: dict[node, predecessor]` — path reconstruction.
+- weight 는 `routing_weight_mm or distance_mm`. UCIe 처럼 routing_weight 가
+  명시되어 distance 와 다른 edge 가 있으므로 weight 분리가 의도된 것.
+
+`start == goal` 은 빠른 path `([start], 0.0)` 반환. 도달 불가는
+`RoutingError(f"no path from {start} to {goal}")`.
+
+이 알고리즘은 **deterministic** 하다 — 같은 graph + start/goal 이면 같은
+경로. 이는 SPEC R1 의 "Routing MUST be deterministic" 요구와 정합. tie-
+break 는 `heapq` 의 push 순서를 따른다 (Python list 순서가 deterministic).
+
+### D5. helper API 의 단일 소유자 원칙
+
+다음 정보는 오직 router.py 안에서만 결정된다:
+
+- 명명 규칙: `sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`,
+  `sip{S}.cube{C}.hbm_ctrl.pe{pe_id}`.
+- adjacency 정책: 어떤 edge kind 가 어떤 그래프에 포함되는가.
+- HBM slice 크기로부터 PE id 복원 방법.
+- Dijkstra의 weight 결정 (`routing_weight_mm or distance_mm`).
+
+이 단일 소유자 원칙이 깨지면 (예: 컴포넌트가 자체적으로 `f"sip{s}..."` 를
+구성하기 시작하면) 명명 규칙 변경 시 영향 범위가 폭발한다. ADR-0015 D4 의
+정신과 정렬.
+
+### D6. helper API consumer 의 목록
+
+본 helper 가 노출하는 메소드를 호출하는 곳을 명시 (현재 코퍼스 기준):
+
+- `probes/probe.py` (ADR-0049): `find_pcie_ep`, `find_io_cpu`,
+  `find_m_cpu`, `find_node_path`, `find_mcpu_dma_path`,
+  `find_memory_path`, `find_path`, `resolve`.
+- `runtime_api/distributed.py` (ADR-0047): 간접 (engine 내부 routing).
+- `ccl/install.py` (ADR-0023): `find_all_pcie_eps`, `resolve`.
+- `sim_engine/event_log.py`: probe 와 유사하게 `find_pcie_ep`,
+  `find_memory_path`.
+- `components/builtin/m_cpu.py`, `components/builtin/io_cpu.py`:
+  `find_node_path`, `find_mcpu_dma_path`.
+- 각종 tests (test_routing.py, test_cross_sip_routing.py 등): D3.1–D3.5
+  대부분.
+
+새 consumer 가 추가될 때 본 ADR 의 D1/D3 가 그 의도에 맞는 메소드가
+이미 있는지 / 새 메소드를 추가해야 하는지 1차 판단의 기준이 된다.
+
+## Alternatives Considered
+
+### A1. 단일 adjacency graph + edge-kind filter 동적 적용
+
+기각. 매 `find_*()` 마다 graph filtering 을 다시 하면 Dijkstra 의 cache
+locality 와 성능이 떨어진다. 4 개 그래프 동시 구축 (D2) 은 메모리 비용
+이 작고 (edge ≤ 수만 건 규모), 호출 시점에 정책 선택이 O(1) 로 결정.
+
+### A2. adjacency 분리를 edge 의 `kind` 가 아닌 별도 metadata 로
+
+기각. edge `kind` 는 이미 topology builder 가 부여하며 (ADR-0015 D4 +
+ADR-0017), 별도 metadata 를 도입하면 두 시스템이 동기화되어야 하는
+중복이 생긴다.
+
+### A3. Dijkstra 대신 BFS + uniform weight
+
+기각. routing_weight_mm 이 edge 별로 다른 (mesh link / UCIe / IO-internal)
+현실에서 BFS 는 hop 수 최소화일 뿐 latency / distance 최단을 보장하지
+않는다. SPEC R1 + R2 의 결정적·정확한 routing 요구에 어긋남.
+
+### A4. helper API 를 클래스 메서드가 아닌 모듈 함수로
+
+기각. 두 클래스 (`AddressResolver`, `PathRouter`) 가 각각 cache 상태
+(`_node_ids`, `_hbm_slice_bytes`, 4 adjacency graphs) 를 보유해야 하며,
+같은 graph 인스턴스에 여러 routing 질의가 발생한다. 모듈 함수는 매 호출
+시 state 를 다시 만들거나 global 로 두어야 해서 안전성/성능 저하.
+
+## Consequences
+
+- 컴포넌트 / probe / IPCQ install / runtime API 가 모두 router.py 의
+  helper 만 호출하면 명명 규칙 변경 (예: `.io0.` → `.iochiplet0.`) 이
+  단 한 파일 수정으로 끝남 (D5).
+- D2 의 4 그래프 분리가 ADR 에 굳어져, 새 edge kind 가 추가될 때 (예:
+  Inter-die UCIe link 의 새 kind) 어느 그래프에 포함시킬지 결정의 명확
+  한 기준 제공.
+- D3.1 의 cube-local vs cross-cube 분기 (ADR-0017 D7) 가 명시되어, 향후
+  routing 동작을 변경하려는 사람이 어느 adjacency 를 건드려야 할지 안다.
+- D6 의 consumer 목록이 명시되어, helper API 변경 시 PR review 범위가
+  분명. backward-compat shim (D3.6) 의 deprecation 후보가 식별됨.
@@ -0,0 +1,352 @@
+# ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals
+
+## Status
+
+Accepted (2026-05-22).
+
+`sim_engine/op_log.py` 의 `OpRecord` 스키마와 `OpLogger` 의 record_start /
+record_end / record_copy 동작, 그리고 `sim_engine/memory_store.py` 의
+`MemoryStore` 가 사용하는 (space, addr) 주소공간 namespace 와 read/write
+의미를 명시한다. ADR-0020 (2-pass data execution) 가 두 인프라의 존재를
+선언하나, **레코드의 정확한 필드와 의미** 는 ADR-level 에서 정리되지
+않았고 ADR-0046 D3.2 (`tl.store` visibility), ADR-0023 D9 (IPCQ copy
+record) 등 여러 ADR 이 이들의 동작에 의존하고 있다.
+
+## First action (제일 처음에 하는 일)
+
+### `OpLogger(memory_store=None)`
+
+생성 즉시 다음 3 가지 필드 초기화:
+
+1. `self._records: list[OpRecord] = []` — 누적된 op record.
+2. `self._pending: dict[int, dict] = {}` — `id(msg)` 키로 partial record
+   (record_start 시점에 만들어졌고 record_end 가 아직 안 온 것).
+3. `self._memory_store = memory_store` — 옵션 MemoryStore reference.
+   math op 의 input 스냅샷 + dma_write 의 HBM source 스냅샷 캡처에 사용.
+
+생성 시점에는 records / pending 모두 비어 있으며, `record_*` 호출이
+순차적으로 데이터를 누적한다.
+
+### `MemoryStore()`
+
+생성 즉시 `self._storage: dict[str, dict[int, np.ndarray]] = {}` 단 하나
+의 필드 초기화. 두 단계 dict (`space → addr → ndarray`) 이며 lazy 하게
+필요한 space 가 생길 때마다 inner dict 가 채워진다.
+
+즉, **두 인프라의 첫 일은 "비어 있는 누적 buffer + space-별 sparse dict
+를 만들어 두는 것"** 이다. 첫 record / write 가 실제로 도착하면 그때
+필드가 채워지기 시작한다.
+
+## Context
+
+ADR-0020 (2-pass data execution) 의 D2/D5/D7 가 다음을 선언:
+
+- Phase 1 (timing) 동안 `ComponentBase._on_process_start/end` hook 이
+  `OpLogger.record_start/end` 를 호출하여 모든 data op 의 시간 + 메타
+  데이터를 기록.
+- Phase 2 (data) 가 op_log 를 t_start 순으로 재생하여 실 데이터 결과를
+  계산.
+- 데이터 페이로드 자체는 `MemoryStore` 에 (space, addr) 키로 보관.
+
+ADR-0023 D9 (IPCQ atomic write), ADR-0027 (Megatron TP scratch
+overwrite 회피), ADR-0046 D3.2 (`tl.store` visibility) 등 후속 ADR 들이
+op_log 와 MemoryStore 의 동작에 의존하지만, **정확한 record 필드 / space
+이름 / 스냅샷 시점** 은 코드 grep 으로만 확인 가능하다. 본 ADR 이 이를
+정리한다.
+
+## Decision
+
+### D1. `OpRecord` 스키마 — 7 개 필드
+
+```python
+@dataclass
+class OpRecord:
+    t_start: float
+    t_end: float
+    component_id: str
+    op_kind: str               # "memory" | "gemm" | "math" | "unknown"
+    op_name: str               # e.g. "dma_read", "gemm_f16", "exp",
+                               #     "TileToken/DMA_READ", "composite_gemm",
+                               #     "ipcq_copy"
+    params: dict[str, Any]
+    dependency_ids: list[int] = field(default_factory=list)
+```
+
+- **`t_start` / `t_end`**: SimPy 시간 (float ns). `t_start` 는 component
+  가 op 를 시작한 시점, `t_end` 는 완료 시점. duration = `t_end - t_start`.
+- **`component_id`**: op 가 발생한 node id (예:
+  `"sip0.cube0.pe0.pe_dma"`).
+- **`op_kind`**: 4 가지 중 하나. Phase 2 DataExecutor 가 이 값으로 분기.
+- **`op_name`**: 디버깅 / 분석용 사람-친화 이름. TileToken 일 경우
+  `"TileToken/{stage_type}"` (예: `"TileToken/DMA_READ"`) 로 stage 를
+  구분.
+- **`params`**: op-종속 메타데이터 dict (D3 참고).
+- **`dependency_ids`**: 현재 사용되지 않음 (default `[]`). 향후 cross-op
+  dependency 추적이 필요해질 때를 위한 자리.
+
+### D2. `OpLogger.records` — t_start 정렬 보장
+
+```python
+@property
+def records(self) -> list[OpRecord]:
+    self._records.sort(key=lambda r: r.t_start)
+    return self._records
+```
+
+매 접근 시 `t_start` 로 stable sort. 즉 같은 t_start 인 record 들은 insertion
+순서를 유지. ADR-0020 D5 의 "t_start stable ordering" 요구와 정합.
+
+Phase 2 DataExecutor 는 항상 `records` property 를 통해 접근하므로,
+record_end 호출이 t_start 와 다른 순서로 도착해도 (예: 짧은 op 가 긴
+op 보다 늦게 시작했으나 먼저 끝남) 재정렬되어 일관된 시퀀스를 받는다.
+
+### D3. op_name 별 `params` 스키마 (`_extract_op_info` 매핑)
+
+#### D3.1. `op_kind="memory", op_name="dma_read"` (DmaReadCmd)
+
+```python
+{"src_addr": int, "nbytes": int, "handle_id": str}
+```
+
+#### D3.2. `op_kind="memory", op_name="dma_write"` (DmaWriteCmd)
+
+```python
+{
+    "src_space": str,   # handle.space ("tcm"|"hbm"|"sram"), default "tcm"
+    "src_addr": int,    # handle.addr
+    "shape": tuple, "dtype": str,
+    "dst_space": "hbm", # DmaWrite 는 항상 HBM 으로
+    "dst_addr": int,
+    "nbytes": int,
+    "handle_id": str,
+    # record_end 시점에 src_space == "hbm" 이면 snapshot 추가 (D4)
+    "snapshot": np.ndarray | None,
+}
+```
+
+#### D3.3. `op_kind="gemm", op_name=f"gemm_{dtype_a}"` (GemmCmd)
+
+```python
+{
+    "src_a_addr": int, "src_b_addr": int, "dst_addr": int,
+    "shape_a": tuple, "shape_b": tuple, "shape_out": tuple,
+    "dtype_in": str, "dtype_out": str,
+    "m": int, "k": int, "n": int,
+    # ADR-0027: per-operand + output spaces 보존
+    "src_a_space": str, "src_b_space": str, "dst_space": str,
+}
+```
+
+#### D3.4. `op_kind="math", op_name=msg.op` (MathCmd; op = "exp", "sum", "add", "where" 등)
+
+```python
+{
+    "input_addrs": list[int],   # 입력 핸들들의 addr
+    "input_shapes": list[tuple],
+    "input_spaces": list[str],
+    "input_dtypes": list[str],
+    "dst_addr": int, "dst_space": str,
+    "shape_out": tuple, "dtype": str,
+    "axis": int | None,         # reduction 인 경우만 의미 있음
+    # record_end 시점에 모든 input 의 스냅샷이 채워짐 (D4)
+    "input_snapshots": list[np.ndarray | None],
+}
+```
+
+#### D3.5. `op_kind="gemm" or "math", op_name=f"composite_{op}"` (CompositeCmd)
+
+```python
+{
+    "op": str,              # "gemm" | "math"
+    "out_addr": int, "out_nbytes": int,
+    # op == "gemm" 인 경우 GemmCmd 와 같은 필드 추가:
+    "src_a_addr": int, "src_b_addr": int,
+    "shape_a": tuple, "shape_b": tuple,
+    "dtype_in": str, "dtype_out": str,
+    "src_a_space": str, "src_b_space": str,
+    "dst_space": "hbm", "dst_addr": int,  # = out_addr
+}
+```
+
+`op == "gemm"` 이면 `op_kind = "gemm"`, 아니면 `"math"`. Phase 2 측에서
+GemmCmd 와 동일 path 로 재생되도록 alias.
+
+#### D3.6. `op_kind="memory", op_name="ipcq_copy"` (record_copy 전용 경로)
+
+```python
+{
+    "src_space": str, "src_addr": int,
+    "dst_space": str, "dst_addr": int,
+    "shape": tuple, "dtype": str, "nbytes": int,
+    "snapshot": np.ndarray | None,   # 호출자가 전달, 없으면 record_copy 가 fresh read
+}
+```
+
+`PE_DMA._handle_ipcq_inbound` (ADR-0023 D9) 가 이 record 를 발사하여 IPCQ
+slot 의 inbound copy 를 Phase 2 가 재생 가능하게 한다. 이 record 는
+`record_start` / `record_end` 를 거치지 않고 직접 `record_copy()` 로 push.
+
+#### D3.7. `op_kind="unknown", op_name=type(msg).__name__`
+
+`_extract_op_info` 가 인식 못 한 message 의 fallback. params = `{}`.
+DataExecutor 가 이 op_kind 를 만나면 skip — Phase 2 replay 에 영향 없음.
+
+### D4. snapshot 캡처 시점
+
+`OpLogger._memory_store` 가 set 되어 있을 때 record_end 가 다음을 수행:
+
+- **math op**: 모든 input addr/shape/space/dtype 으로
+  `self._memory_store.read(...)` 를 호출하여 `params["input_snapshots"]` 에
+  ndarray copy 첨부. read 실패 시 None.
+- **dma_write op**: `src_space == "hbm"` 인 경우에만 source HBM 의
+  스냅샷을 `params["snapshot"]` 에 첨부. TCM source 는 **명시적으로
+  스킵** — TCM (PE scratch) 은 Phase 2 math/gemm 재생이 다시 채우므로,
+  Phase-1-time snapshot 을 잡으면 이전 kernel 의 stale 데이터를 잡을 위험
+  (ADR-0027 postmortem: TP gemm → all_reduce race).
+- **ipcq_copy**: `record_copy` 호출자가 `snapshot=token.data` 같이 in-flight
+  스냅샷을 전달. 없으면 record_copy 가 fresh read 로 대체 시도.
+
+스냅샷은 `.copy()` 가 호출되어 (`ndarray.copy()` 가 fresh allocation) 이후
+storage mutation 으로부터 안전. ADR-0027 의 "cross-PE Phase 2 ordering"
+race 회피의 근간.
+
+`memory_store` 가 None 인 경우 (Phase 1 timing-only 모드) 스냅샷 단계는
+전부 skip. record 의 timing 정보만 보존되며 데이터 replay 는 불가능.
+
+### D5. TileToken 처리 — record_start 가 stage 정보를 캡처
+
+ADR-0014 D6 의 self-routing tile token (pipeline 모드) 은 stage_idx 가
+record_end 시점에 이미 advance 되어 있을 수 있다 (TileToken 이 다음
+component 로 이동하면서 next stage 의 params 를 캐시). 따라서:
+
+`record_start` 가 다음을 `pending[id(msg)]["snap"]` 에 미리 저장:
+
+```python
+snap["stage_type"] = stage.stage_type.name        # "DMA_READ", "GEMM", 등
+snap["stage_params"] = dict(stage.params)         # 시점의 params 복사본
+```
+
+`record_end` 에서 이 snap 을 꺼내 params 에 merge:
+
+- `params["stage_type"]` 가 final params 에 추가.
+- `stage_params` 의 key 들이 (이미 있으면 보존) merge.
+- `op_name == "TileToken"` 이면 `op_name = f"TileToken/{stage_type}"` 로
+  rewrite (예: `"TileToken/DMA_READ"`) — 같은 component 에서 발생한 서로
+  다른 stage 의 record 를 disambiguate.
+
+이 메커니즘 덕분에 DMA_READ vs DMA_WRITE, FETCH vs STORE 가 같은 component
+(예: pe_dma) 에서 발생하더라도 reporting 측에서 구분 가능.
+
+### D6. `MemoryStore` — (space, addr) 두 단계 dict
+
+```python
+class MemoryStore:
+    def __init__(self) -> None:
+        self._storage: dict[str, dict[int, np.ndarray]] = {}
+
+    def write(self, space, addr, data): self._storage[space][addr] = data
+    def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ...
+    def has(self, space, addr) -> bool: ...
+    def snapshot(self) -> MemoryStore: ...
+```
+
+#### D6.1. space namespace
+
+문자열 키. 표준 값:
+
+- `"hbm"`: HBM 데이터 (deploy_tensor + Phase 2 dma_write 결과).
+- `"tcm"`: PE-로컬 TCM (Phase 2 math/gemm 결과).
+- `"sram"`: cube-level SRAM (ADR-0023 D9.7 IPCQ slot tier).
+
+다른 space (예: `"reg"`) 도 자유롭게 허용 — `_storage` 가 lazy dict 라
+새 space 가 write 호출과 함께 자동 생성.
+
+#### D6.2. address keying
+
+`addr` 는 정수. **physical address (PA) 또는 virtual address (VA)** 일 수
+있다 — MemoryStore 자체는 address space 의 의미를 모르고 그저 키로 쓴다.
+Phase 1 의 `MemoryWriteMsg` 는 PA + VA 둘 다 write (`_create_tensor` 에서
+PA 로 zero-init, VA base 로도 zero-init), Phase 2 는 op_log 가 captured
+한 address 로 read/write.
+
+`addr` 의 의미는 호출자가 결정한다 — `MemoryStore` 는 lookup 만 제공.
+
+#### D6.3. read/write 의미 — reference store (no copy)
+
+`write(space, addr, data)`: `data` ndarray 의 reference 를 저장. **copy
+하지 않음**. 호출자가 같은 ndarray 를 이후 mutate 하면 stored value 도
+변경된다.
+
+`read(space, addr, shape=None, dtype=None)`: 저장된 ndarray 의 reference
+반환. `shape` 또는 `dtype` 이 제공되면:
+
+- `dtype != stored.dtype`: `arr.view(np_dtype)` 로 reinterpret cast (no
+  copy).
+- `shape != stored.shape`: `nbytes` 가 일치하면 `arr.reshape(shape)` (view).
+- `nbytes` 불일치: `ValueError`.
+
+데이터를 안전하게 분리하려면 호출자가 `arr.copy()` 호출. ADR-0027 의
+race 회피가 op_log snapshot 단계에서 명시적 copy 를 강제하는 이유.
+
+#### D6.4. `has(space, addr) -> bool`
+
+해당 키의 존재 여부만 확인. 데이터 인스턴스화는 안 함.
+
+#### D6.5. `snapshot() -> MemoryStore`
+
+shallow copy. inner dict 의 새 인스턴스를 만들되 ndarray reference 는
+공유. Phase 2 초기화 시점에 Phase 1 의 store 를 fork 하여 Phase 2 의
+mutation 이 Phase 1 의 다른 사용처에 영향을 주지 않게 분리하는 데 사용.
+
+### D7. op_log 가 SimPy 단일-스레드를 가정한다
+
+`OpLogger` 의 `_records`, `_pending` 은 lock 없이 사용. SimPy 가 single-
+threaded 라 `record_start` → `record_end` 사이에 다른 thread 가 끼어들
+수 없다는 가정.
+
+향후 multi-process kernbench (ADR-0047 D6) 가 도입되면 OpLogger 도 process
+별로 분리되어야 함이 명시. 단일 OpLogger 인스턴스가 multiple process 의
+record 를 받지 못한다.
+
+## Alternatives Considered
+
+### A1. op_log 를 SQLite / parquet 같은 외부 store 로
+
+기각 (현재). in-memory list 가 Phase 1 → Phase 2 의 핸드오프 latency 를
+최소화한다. 외부화는 long-running batch run 에서 의미가 있겠으나, 현재
+single-run 워크로드 에서는 overhead 만 추가.
+
+### A2. snapshot 을 record_start 시점에 캡처
+
+기각. record_start 시점은 input 이 아직 채워지지 않은 상황 (예: math
+op 의 input 이 직전 op 의 output 일 때) 이 흔하다. record_end 가 정확한
+시점.
+
+### A3. MemoryStore 를 component-별 store 로 분리
+
+기각. (space, addr) 키가 이미 충분히 disambiguation 을 제공하며, component
+별 분리는 cross-PE IPCQ copy (ADR-0023 D9) 가 source/destination 양쪽
+store 를 접근해야 하는 케이스를 복잡하게 만든다.
+
+### A4. op_log 에 cross-op dependency edge 명시
+
+부분 채택. `dependency_ids` 필드가 OpRecord 에 자리 잡고 있지만 현재
+사용되지 않음 (D1). Phase 2 DataExecutor 가 t_start 정렬 + secondary sort
+(memory ops before math at same t_start) 로 ordering 을 결정하며, 명시적
+dependency graph 가 필요해지면 이 필드가 채워질 자리. 현재는 ordering rule
+이 충분하므로 미사용.
+
+## Consequences
+
+- ADR-0020 의 op_log / MemoryStore 선언이 D1–D6 의 구체 schema 로 확장
+  되어, Phase 2 DataExecutor 작성/수정 시 정확한 필드 의미를 grep 없이
+  ADR 에서 확인 가능.
+- D3 의 op_name 별 params 스키마가 명시되어, 새 op (예: 새 reduction
+  type) 추가 시 `_extract_op_info` 분기 어디에 끼울지 명확.
+- D4 의 snapshot 시점 차이 (math = input snapshot, dma_write = HBM-only
+  snapshot) 가 ADR 에 굳어져, ADR-0027 의 cross-PE race 회피 결정이 향후
+  refactor 에서 silently 깨지지 않음.
+- D6.3 의 reference-store 의미가 명시되어, 호출자가 mutation safety 책임
+  을 인지. ADR-0027 의 explicit `.copy()` 패턴이 정당화됨.
+- D7 의 single-thread 가정이 명시되어, multi-process kernbench (ADR-0047
+  D6 supersession 후보) 도입 시 OpLogger 분리가 필요함이 분명.
@@ -0,0 +1,307 @@
+# ADR-0053: Topology Builder + Visualizer Algorithms
+
+## Status
+
+Accepted (2026-05-22).
+
+`topology/builder.py`, `topology/mesh_gen.py`, `topology/visualizer.py` 가
+함께 수행하는 토폴로지 컴파일·시각화 파이프라인의 핵심 알고리즘 선택
+(placement-driven router attachment, mesh auto-layout, source_hash 캐시,
+view projection, SVG rendering) 을 명시한다. ADR-0006 가 topology
+compilation 의 high-level intent (compiled topology, distance extraction,
+automatic diagram generation) 를 정의하나, **builder 가 실제로 어떤
+알고리즘을 사용하는지** 는 코드 grep 으로만 확인 가능했다.
+
+## First action (제일 처음에 하는 일)
+
+`resolve_topology(path_str)` 가 호출되면 다음 4 단계가 순서대로 일어난다:
+
+1. **경로 검증** (`builder.py::resolve_topology`):
+   `Path(path_str).expanduser().resolve()`, 존재 확인, file 여부 확인.
+   실패 시 `FileNotFoundError` 또는 `ValueError`.
+2. **YAML 파싱** (`_read_spec`): `yaml.safe_load`. parse error 면 line/
+   column 정보 포함한 `ValueError`. dict 가 아니면 reject.
+3. **mesh 자동 생성** (`mesh_gen.ensure_mesh_file`): topology yaml 과
+   같은 디렉터리에 `cube_mesh.yaml` 을 만들거나 (캐시 invalid 시) 재사용
+   (캐시 hit 시). 이 단계가 cube NoC 의 라우터 grid 와 부착 정보를 결정.
+4. **graph 컴파일** (`_compile_graph`): system → IO chiplets → cubes →
+   inter-cube edges → IO↔cube edges → system↔IO edges 순으로 nodes/edges
+   를 누적, 그 다음 4 개의 view projection (system, sip, cube, pe) 을
+   생성하여 `TopologyGraph` 로 묶음.
+
+즉, **topology compile 의 첫 일은 "topology.yaml 을 dict 로 읽고, 동일
+디렉터리에 cube_mesh.yaml 을 생성/검증한 뒤, system→sip→cube→pe 순으로
+flat graph + 4-view projection 을 만드는 것"** 이다.
+
+## Context
+
+`topology/` 패키지의 책임:
+
+- **builder.py** (1207 줄): topology.yaml 을 받아 `TopologyGraph` (nodes
+  + edges + 4 view projections) 를 컴파일.
+- **mesh_gen.py** (305 줄): cube NoC 의 라우터 grid 와 PE/UCIe/M_CPU/SRAM
+  부착 위치를 자동 결정하여 `cube_mesh.yaml` 로 캐시.
+- **visualizer.py** (887 줄): `TopologyGraph` 로부터 SVG 다이어그램 4종
+  (system / sip / cube / pe) 을 생성.
+
+ADR-0006 가 "topology compilation 의 결과는 distance metadata 와 diagram
+generation 의 single source" 라는 high-level 결정을 정의하나, 구체 알고리즘
+(예: placement-driven nearest-router attachment, HBM 제외 zone 산출,
+source_hash 의 어떤 필드가 invalidation 을 트리거하는가) 은 ADR 에 없다.
+
+특히 다음 결정들이 ADR-level 에 부재:
+
+- 왜 mesh_gen 이 별도 파일 (`cube_mesh.yaml`) 로 캐시되는가?
+- source_hash 가 어떤 필드를 포함하며, 어떤 변경이 재생성을 강제하는가?
+- placement coordinate 가 cube 좌표가 아닌 mm 단위인 이유?
+- HBM zone 제외와 UCIe N/S/E/W 분배가 mesh 안에서 어떻게 결정되는가?
+- view projection 4 개 (system/sip/cube/pe) 의 추상화 레벨 차이?
+
+이 ADR 이 이 결정들을 한 곳에 정리한다.
+
+## Decision
+
+### D1. compile 파이프라인 — 6 단계
+
+`_compile_graph(spec)`:
+
+1. **시스템 노드 생성** (`_instantiate_system`): `fabric.switch0`, host CPU
+   등 system-level 노드 추가.
+2. **per-SIP loop** (`for sip_id in range(system.sips.count)`):
+   - **IO chiplets** (`_instantiate_io_chiplets`): pcie_ep / io_cpu /
+     io_noc / io_ucie PHY / conn 노드 + 내부 양방향 edge 생성.
+   - **cube instantiation** (`_instantiate_cube`): cube_mesh.yaml 의 router
+     grid 를 토대로 cube-별 라우터, PE sub-components (pe_cpu, pe_dma,
+     pe_fetch_store, pe_gemm, pe_math, pe_mmu, pe_tcm, pe_scheduler,
+     pe_ipcq), m_cpu, sram, hbm_ctrl 인스턴스화 + 내부 edge 깔기.
+   - **inter-cube edges** (`_add_inter_cube_edges`): UCIe N/S/E/W mesh
+     edge.
+   - **IO ↔ cube edges** (`_add_io_to_cube_edges`): io_noc 와 cube 의
+     edge UCIe phy 사이 연결.
+3. **switch ↔ IO edges** (`_add_system_to_io_edges`): `fabric.switch0`
+   와 각 SIP 의 `pcie_ep` 사이 양방향 edge (ADR-0038 D3 + ADR-0010 의
+   cross-SIP IPCQ 경로).
+4. **view projections** 4 종 build:
+   - `_build_system_view(spec)` — Tray 레벨, SIP 들과 system switch.
+   - `_build_sip_view(spec)` — SIP 안의 cube mesh + IO chiplet.
+   - `_build_cube_view(spec)` — 단일 cube 안의 router grid + PE/M_CPU/SRAM/
+     HBM_CTRL 부착.
+   - `_build_pe_view(spec)` — 단일 PE 안의 9 sub-components + 내부 edge.
+5. **TopologyGraph 리턴**: `TopologyGraph(spec, nodes, edges, system_view,
+   sip_view, cube_view, pe_view)`.
+
+이 6 단계는 **순서가 의미를 가진다**: cubes 가 만들어진 후에야 inter-cube
+edges 가 valid 한 src/dst 를 갖고, IO chiplet 이 먼저 만들어져야 IO ↔ cube
+edge 가 그를 참조할 수 있다. 새 노드 종류를 끼울 때는 의존 관계를 보고
+적절한 위치에 삽입해야 한다.
+
+### D2. `cube_mesh.yaml` — 별도 파일 + source_hash 캐시
+
+`mesh_gen.ensure_mesh_file(cube_spec, mesh_path)`:
+
+1. `source_hash = _compute_source_hash(cube_spec)` 산출. 입력 필드:
+   - `geometry` (cube_mm.w/h 등).
+   - `pe_layout` (corners, pe_per_corner).
+   - `ucie.n_connections`.
+   - `memory_map.hbm_mapping_mode`.
+   - `placement` (m_cpu/sram pos_mm).
+2. `mesh_path` (= `topology.yaml` 와 같은 디렉터리의 `cube_mesh.yaml`) 이
+   존재하고 `existing.source_hash == source_hash` 면 재사용 (캐시 hit).
+3. 아니면 `_generate_mesh(cube_spec, source_hash)` 로 새 mesh 생성 후
+   yaml 로 저장.
+
+별도 파일로 캐시하는 이유:
+
+- mesh 생성은 PE/UCIe/router 부착 계산이 들어가 매번 다시 하기 무거움.
+- 같은 cube spec 으로 여러 번 실행 시 동일 mesh 가 보장되어야 함.
+- 사람이 직접 mesh 를 inspect / debug 할 수 있는 artifact 가 됨.
+
+`source_hash` 가 list 한 5 개 필드가 mesh 형상을 결정하는 핵심이며, 그
+외 (예: bandwidth, overhead_ns) 변경은 mesh 재생성을 트리거하지 않는다.
+
+### D3. cube NoC mesh auto-layout 알고리즘
+
+`_generate_mesh(cube_spec)`:
+
+#### D3.1. 행/열 결정
+
+- `pe_positions = _corner_pe_positions(cube_w, cube_h)`: 4 corner (NW/NE/
+  SW/SE) 마다 PE center 좌표 (mm). hardcoded `(1.5, 1.5)` / `(cube_w-1.5,
+  cube_h-1.5)` 패턴 + `pe_per_corner=2` 면 각 corner 에 2 PE 위치.
+- `col_xs = _compute_col_positions(...)`: PE 들의 x 좌표 union + `max_spacing
+  = 3.0 mm` 보다 큰 gap 에 relay 컬럼 삽입.
+- `row_ys, rows_per_half = _compute_row_positions(cube_h, n_connections,
+  pe_positions)`:
+  - `n_conn = max(n_connections, 2)` (hot path minimum).
+  - `rows_per_half = ceil(n_conn / 2)`.
+  - top 절반 + HBM 두 row + bottom 절반. HBM 은 `(cube_h/2 - 1.5, cube_h/2
+    + 1.5)` 에 위치. PE rows 와 HBM rows 사이 `hbm_gap = 1.5 mm`.
+
+#### D3.2. HBM 제외 zone
+
+`hbm_row_start = rows_per_half`, `hbm_row_end = rows_per_half + 1`.
+`hbm_col_start = n_cols // 2 - 1`, `hbm_col_end = n_cols // 2`.
+
+이 (row, col) 사각형 안의 router 슬롯은 `None` 으로 마킹 (라우터 없음).
+실제 HBM 컨트롤러는 별도 `hbm_ctrl.pe{X}` 노드로 ADR-0017 D9 의 per-PE
+파티션 패턴을 따라 부착.
+
+#### D3.3. PE 부착
+
+각 corner 의 PE 들은 다음 row 에 매핑:
+
+- Top half: NW → row 0, NE → row 1 (top_corners 안의 index).
+- Bottom half: SW → row `hbm_row_end + 1`, SE → row `hbm_row_end + 2`.
+
+각 PE 의 x 좌표가 가장 가까운 col 의 router 에 부착 (`min(range(n_cols),
+key=lambda c: abs(col_xs[c] - pe_x))`). 부착 항목은 `pe{pe_idx}.dma`,
+`pe{pe_idx}.cpu`, `pe{pe_idx}.hbm` 세 가지 (router 별 attach list 에 push).
+
+#### D3.4. M_CPU / SRAM 부착 — nearest router by Euclidean distance
+
+`placement.m_cpu.pos_mm` (default `[1.5, 5.5]`) 와 `placement.sram.pos_mm`
+(default `[1.5, 8.5]`) 의 좌표에서 가장 가까운 router 를 Euclidean
+distance 로 찾아 attach list 에 `"m_cpu"` / `"sram"` 추가.
+
+#### D3.5. UCIe N/S/E/W 분배
+
+`ucie_pe_rows = top_pe_rows + bot_pe_rows` (총 `2 * rows_per_half` 개).
+
+- UCIe-E: 매 PE row 마다 rightmost col 의 router 에 `ucie_e.c{i}`.
+- UCIe-W: leftmost col 의 router 에 `ucie_w.c{i}` (E 의 mirror).
+- UCIe-N/S: PE column 들 중 절반을 좌측, 절반을 우측으로 나눠 top row /
+  bottom row 의 해당 col 에 부착.
+
+각 UCIe connection 은 `c{i}` index 가 붙어 ucie_n_connections 만큼의 PHY
+가 분산된다 (ADR-0017 D5+).
+
+### D4. node 명명 규칙 — 단일 소유자
+
+builder.py 는 다음 명명 규칙으로 노드를 만든다 (ADR-0051 D5 의 단일
+소유자 원칙):
+
+- `fabric.switch0` — system-level switch.
+- `sip{S}.{io_id}.{pcie_ep|io_cpu|io_noc|io_ucie.{dir}|conn.{id}}` — IO
+  chiplet.
+- `sip{S}.cube{C}.{m_cpu|sram|hbm_ctrl.pe{X}|noc.r{R}c{C}|...}` — cube 내부.
+- `sip{S}.cube{C}.pe{P}.{pe_cpu|pe_dma|pe_fetch_store|pe_gemm|pe_math|pe_mmu|pe_tcm|pe_scheduler|pe_ipcq}` — PE sub-components.
+
+이 명명 규칙을 변경하려면 builder.py 와 router.py (ADR-0051) 의 helper
+양쪽이 함께 갱신되어야 한다. 컴포넌트는 명명 규칙을 직접 알지 못하고
+helper 만 호출한다.
+
+### D5. edge `kind` 분류
+
+각 edge 가 부여받는 `kind` 가 라우팅 정책 (ADR-0051 D2) 의 입력. 주요
+kind 값:
+
+- `"pe_internal"` — PE 내부 sub-component 간.
+- `"pe_to_router"` — PE_DMA ↔ cube NoC router.
+- `"router_mesh"` — cube NoC router 간.
+- `"router_to_hbm"`, `"router_to_mcpu"`, `"router_to_sram"`,
+  `"sram_to_router"` 등 — cube-attached component 간.
+- `"ucie_internal"`, `"ucie_conn_to_router"`, `"router_to_ucie_conn"`,
+  `"ucie_conn_to_noc"`, `"noc_to_ucie_conn"`, `"ucie_mesh"` — UCIe 관련.
+- `"io_internal"` — IO chiplet 내부.
+- `"io_to_cube"`, `"cube_to_io"` — IO ↔ cube 경계.
+- `"pcie"` — switch ↔ pcie_ep.
+- `"command"` — control-plane only edges (M_CPU ↔ NOC 등; PE DMA path 에서
+  제외).
+
+새 edge kind 를 추가하면 router.py 의 4 adjacency graph (ADR-0051 D2) 의
+어느 카테고리에 속할지 결정해야 한다 — 그렇지 않으면 default 로 `_adj_all`
+에만 포함되어 의도와 다른 routing 발생 가능.
+
+### D6. view projection — 4 추상화 레벨
+
+`TopologyGraph` 는 flat (nodes + edges) 외에 4 개의 view projection 을
+보유:
+
+- **system_view** (`_build_system_view`): Tray 레벨. SIP 박스들 + `fabric.
+  switch0`. PCIE 링크 표시. 외부 발표용 high-level overview.
+- **sip_view** (`_build_sip_view`): 한 SIP 안. cube mesh + IO chiplet
+  (pcie_ep + io_cpu + io_noc). UCIe N/S/E/W 가 cube 간 연결로 보임.
+- **cube_view** (`_build_cube_view`): 한 cube 안. router grid + PE/M_CPU/
+  SRAM/HBM_CTRL 부착 + UCIe PHY edge 부분. cube 내부 라우팅 / placement
+  진단용.
+- **pe_view** (`_build_pe_view`): 한 PE 안. 9 sub-components + 내부 edge
+  (pe_internal kind). 자세한 PE 내부 dataflow 검토용.
+
+view 는 spec 에서 `visualization.emit_views: [system, sip, cube]` 같이
+선택적으로 출력 (ADR-0006). pe view 는 기본 출력에서 빠져 있으나 코드는
+유지 (자세한 디버그용).
+
+### D7. visualizer.py — SVG 다이어그램 출력
+
+`emit_diagrams(graph, out_dir)` 가 모든 view 를 SVG 로 렌더. 핵심 함수:
+
+- `_render_view_svg(view)` — 일반적인 view 렌더 (router grid 가 없는
+  경우).
+- `_render_cube_view_svg(view, spec)` — cube view 전용 (HBM block 그리기,
+  router grid layout, PE/M_CPU/SRAM/HBM positioning).
+- `_draw_node`, `_draw_edge` — 노드 / edge 의 시각적 표현.
+- `_pick_scale`, `_compute_node_sizes` — 자동 스케일링.
+
+visualizer 는 **derived artifact** (ADR-0006) 로 분류되며, 코드 변경 시
+production check 대상이 아니다. CLAUDE.md 의 "Derived Artifacts" 항목과
+정합.
+
+### D8. spec 변경의 영향 범위
+
+| spec 필드                             | 영향              | mesh 재생성 |
+|---------------------------------------|-------------------|-------------|
+| `system.sips.count`                   | SIP 갯수, node 수 | No          |
+| `sip.cube_mesh.w/h`                   | cube mesh 형상    | No          |
+| `cube.geometry.cube_mm.w/h`           | cube 크기 (mm)    | **Yes**     |
+| `cube.pe_layout.corners/pe_per_corner`| PE 부착 위치       | **Yes**     |
+| `cube.ucie.n_connections`             | UCIe PHY 분배     | **Yes**     |
+| `cube.memory_map.hbm_mapping_mode`    | HBM 분배 모드     | **Yes**     |
+| `cube.placement`                      | M_CPU/SRAM 위치   | **Yes**     |
+| `cube.memory_map.*` (위 제외)         | HBM 용량 / BW     | No          |
+| `*.links.*.bw_gbs`                    | edge bandwidth    | No          |
+| `*.attrs.overhead_ns`                 | 컴포넌트 latency  | No          |
+
+위 표가 D2 의 `_compute_source_hash` 입력과 일치. mesh 재생성이 필요한
+변경은 `cube_mesh.yaml` 의 source_hash 가 자동 invalidate.
+
+## Alternatives Considered
+
+### A1. mesh 를 별도 캐시 파일 없이 매 compile 시 재생성
+
+기각. 같은 spec 으로 여러 번 호출되는 케이스 (CLI run, probe, test) 마다
+mesh 생성 비용을 다시 지불. 또한 사람이 mesh 를 inspect 할 수 있는 artifact
+가 사라짐.
+
+### A2. mesh 생성을 builder.py 에 합치기
+
+기각 (현재). 305 줄 짜리 자체 알고리즘이며, mesh layout 의 결정 (placement-
+driven router attachment, HBM exclusion zone) 이 builder 의 일반적인
+node/edge 생성 책임과 다르다. 분리 유지가 단일 책임 원칙에 더 부합.
+
+### A3. placement coordinate 를 cube 좌표 (col/row) 로 표현
+
+기각. mm 단위 좌표가 시각화 측 (visualizer) 과 mesh layout 측 (nearest-
+router 산출) 양쪽에서 일관되게 쓰인다. cube 좌표는 router grid 가 결정
+되기 전까지는 정의되지 않으므로 placement 입력에 부적절.
+
+### A4. view projection 을 lazy 하게 생성
+
+기각 (현재). 4 개 view 의 생성 비용이 작고 (보통 < 100 ms), eager 생성이
+`TopologyGraph` 를 통한 single source of truth 를 보장.
+
+### A5. visualizer 출력 형식을 SVG 외 (PNG/PDF) 도
+
+기각. SVG 가 vector + 텍스트 검색 가능 + 브라우저 직접 렌더가 가능한 가장
+유연한 형식. PNG 변환이 필요하면 별도 도구 (rsvg-convert 등) 로 후처리.
+
+## Consequences
+
+- ADR-0006 의 high-level intent 가 D1–D7 로 구체화되어, topology 변경
+  영향을 D8 표로 빠르게 가늠 가능.
+- D3 의 mesh auto-layout 알고리즘이 ADR-level 에서 굳어져, 추후 새 PE
+  부착 패턴 (예: HBM 의 6-zone 분할) 도입 시 어느 단계가 영향받는지 명확.
+- D5 의 edge kind 목록과 D7 의 view 구조가 명시되어, 새 component 종류
+  추가 시 (builder + router + visualizer) 어디까지 손대야 하는지 PR
+  reviewer 가 한눈에 파악 가능.
+- D2 의 source_hash invalidation 규칙이 명시되어, cube_mesh.yaml 이 stale
+  하게 남는 경우 (예: bw 값만 바꿨을 때) 가 정상 동작임이 분명.
@@ -0,0 +1,138 @@
+# ADR-0054: 마일스톤 평가 bench — 자기완결적 sweep + figure bench
+
+## Status
+
+Accepted (2026-05-22).
+
+ADR-0044(D1/D2)와 ADR-0045(D5)를 개정하고, ADR-0043/0044의 "로직이
+`scripts/` + `tests/`에 산다" 배치를 대체한다: GEMM/allreduce 평가
+하니스가 이제 사용자가 실행하여 모든 결과 + figure를 재생성하는
+자기완결적 **bench**가 된다.
+
+## Context
+
+ADR-0043(allreduce 평가)과 ADR-0044(GEMM 평가)는 각 하니스를 **sweep**
+(수동 `scripts/` 드라이버, 또는 allreduce의 경우 parametrized 테스트
+자체) + committed 데이터를 렌더링하는 **figure 테스트**로 분리했다.
+따라서 sweep/render 로직은 `scripts/gemm_sweep.py`,
+`tests/gemm/_gemm_plot_helpers.py`, `tests/sccl/_allreduce_helpers.py`에
+존재했다.
+
+마일스톤 요구사항("사용자가 *하나의 bench*를 실행해 모든 결과와 플롯을
+생성하도록 allreduce + GEMM 평가를 리팩터")은 그 배치로는 충족 불가다:
+bench는 production 코드이며 **`tests/`를 import할 수 없다**(ADR-0007 레이어
+방향). 평가 로직은 bench에서 닿을 수 있도록 production으로 이동해야 했다.
+
+선택한 home은 별도 `kernbench.eval` 패키지가 아니라 bench 모듈 자체다.
+bench 파일은 임의의 모듈 레벨 코드를 가질 수 있으며, 하니스를 bench로
+합치면 도메인당 파일 하나가 유지되고 패키지 레이어가 하나 줄어든다.
+
+## Decision
+
+### D1. 두 마일스톤 bench가 평가 로직을 보유
+
+- `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep
+  + 세 figure renderer(`scripts/gemm_sweep.py` +
+  `tests/gemm/_gemm_plot_helpers.py`에서 이동).
+- `src/kernbench/benches/milestone_1h_ccl.py` — distributed allreduce
+  드라이버, latency + buffer-kind sweep, topology diagram, FSIM 비교, 그리고
+  direct-launch 패리티 레퍼런스(`tests/sccl/_allreduce_helpers.py`에서 이동).
+
+각 파일은 해당 도메인 평가 로직의 **단일 home**이다.
+
+### D2. "평가 bench" 패턴 (ADR-0045 D5 확장)
+
+ADR-0045 D5는 bench를 단일 구성(single-SIP, 또는 ADR-0024 multi-SIP CCL
+예외)으로 고정했다. 본 ADR은 세 번째 패턴을 추가한다:
+
+- **평가 bench**는 *여러* 구성을 구동하고 figure를 렌더링할 수 있다. 외부
+  `run_bench` 엔진 대신 sweep 지점마다 자체 `GraphEngine` /
+  `RuntimeContext`를 빌드한다.
+- 그러면 외부 ctx에 제출된 handle이 없으므로, bench는 마지막에
+  **sentinel 텐서**(`torch.zeros((1, 1), …)`)를 제출하여 `run_bench`의
+  "최소 한 번 제출" 계약(ADR-0045 D4)을 만족시키고 CLI가 0으로 종료되게
+  한다.
+
+### D3. 출력 위치
+
+두 bench 모두 `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`에
+쓴다(사용자 요청 — bench 옆 아티팩트). 디렉터리는 생성된 PNG/CSV/JSON만
+보유하며(`.py`/`__init__.py` 없음), 따라서 eager-import audit(ADR-0045
+첫 동작)이 무시한다 — `pkgutil.iter_modules`는 비-패키지 하위 디렉터리를
+yield하지 않는다. `docs/diagrams/` 아티팩트처럼 **커밋된다**(원격에서
+figure를 볼 수 있도록); bench 재실행 시 제자리에서 재생성된다.
+
+### D4. GEMM 무거운 sweep — 기본은 fresh, `MILESTONE_FAST`로 재사용
+
+`milestone-1h-gemm`은 기본적으로 전체 24-sim sweep을 실행한다(분 단위;
+한 shape는 2048 tile). `MILESTONE_FAST=1`은 committed
+`docs/diagrams/gemm_sweep.json`을 재사용하고 렌더링만 한다(초 단위). 이는
+ADR-0044 D1/D2의 "무거운 sweep은 수동/`slow` 단계로 유지"를 뒤집는다:
+bench 실행이 곧 재생성이다. slow 경로는 `@pytest.mark.slow` bench
+테스트로 행사되고, fast 경로는 기본 실행된다.
+
+### D5. 테스트 + 스크립트는 thin re-export shim으로 재사용 (단일 home 유지)
+
+기존 figure 테스트와 `scripts/gemm_sweep.py` 진입점은 유지되며 이제 bench
+모듈을 재사용한다:
+
+- `tests/gemm/_gemm_plot_helpers.py` → renderer +
+  `GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT`를
+  `kernbench.benches.milestone_1h_gemm`에서 re-export.
+- `tests/sccl/_allreduce_helpers.py` → 드라이버 코어, config writer, sweep
+  상수, renderer, disk aggregator를 `kernbench.benches.milestone_1h_ccl`에서
+  re-export하고, **pytest 전용** 조각은 로컬 유지: `pytest.param` 행렬
+  (`CONFIGS` / `_sweep_params` / `_bk_params`)과 fixture 결합
+  `_run_distributed`(`monkeypatch.chdir` + `_drive_distributed`) wrapper.
+- `scripts/gemm_sweep.py` → bench의 `run_sweep` 위 thin wrapper.
+
+테스트가 bench 모듈을 import하는 것은 허용된다(테스트는 production 위에
+위치, ADR-0007); 이는 전체 패키지 eager audit을 유발하며, 그것은 이미 매
+`kernbench` 실행 시 동작한다. matplotlib는 renderer 내부에서 lazy import로
+유지되어 audit의 startup 비용은 불변이다.
+
+### D6. 평면 모듈 네이밍 (`benches/` 하위 폴더 없음)
+
+`1H_milestone…`로 명명된 `benches/` 하위 패키지는 불가능하다 — Python
+패키지 이름은 숫자로 시작할 수 없다. 따라서 bench는 평면 모듈
+`milestone_1h_gemm.py` / `milestone_1h_ccl.py`이며 bench 이름은
+`milestone-1h-gemm` / `milestone-1h-ccl`(kebab-case, ADR-0045 D1에 따라
+글자로 시작)이다.
+
+## Consequences
+
+### Positive
+
+- `kernbench run --bench milestone-1h-gemm`(또는 `…-ccl`)이 도메인의 모든
+  결과 + figure를 한 명령으로 재생성한다 — 마일스톤 요구사항.
+- 평가 로직의 단일 소스(bench), shim을 통해 테스트와 스크립트가 재사용;
+  중복 없음.
+- figure 테스트와 `scripts/gemm_sweep.py`는 변경 없이 계속 동작.
+
+### Negative / limitations
+
+- 두 bench 파일이 크다(CCL 쪽은 distributed 드라이버, sweep, matplotlib
+  드로잉을 섞는다). 대부분 평가 하니스인 "bench"는 이례적이며, 본 ADR이
+  이를 정당화한다.
+- 생성 아티팩트가 명시적 요청에 의해 source tree(`src/kernbench/benches/`)
+  안에 살며 커밋된다(원격에서 figure를 볼 수 있도록); bench 재실행 시
+  재생성된다.
+- `milestone-1h-ccl`(및 기본 `milestone-1h-gemm`)은 분 단위 소요 —
+  on-demand 마일스톤 아티팩트에는 수용 가능, 일상 실행에는 아님.
+
+## Dependencies
+
+- **ADR-0007**: 레이어 방향(테스트는 production을 import할 수 있으나 bench는
+  테스트를 import할 수 없는 이유).
+- **ADR-0043 / ADR-0044**: 본 ADR이 bench로 이전하는 allreduce / GEMM 평가
+  하니스.
+- **ADR-0045**: bench 모듈 계약; 여기 D2가 그 D5(single-device 규칙)를
+  평가-bench 패턴으로 확장하고, sentinel을 위해 D4(NO_REQUESTS)에 의존.
+- **ADR-0024**: allreduce sweep이 구동하는 rank = SIP launcher.
+
+## Open questions
+
+- GEMM theoretical 모델 상수(ADR-0044 D5)를 복사 대신 ADR-0033/0014에서
+  소싱해야 하는가? 본 ADR로는 불변.
+- `build_overview_slides.py`가 GEMM 막대를 네이티브로 그리는 대신 마일스톤
+  출력 PNG를 소비해야 하는가? 여전히 open(ADR-0044 D6 / Negative).
@@ -0,0 +1,175 @@
+# ADR Index
+
+Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **47**.
+
+Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
+
+## Design Principles
+
+- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — 검증 전략 및 Phase 1 테스트 계획
+- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — 레이턴시 모델: 가정 및 알려진 단순화
+
+## High-level Architecture
+
+- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — 타겟 시스템 계층 및 모델링 범위  _(System hierarchy (Tray / SIP / CUBE / PE))_
+- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — 런타임 API 및 시뮬레이션 엔진 경계  _(Runtime API ↔ sim_engine boundaries)_
+- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NoC와 메모리 데이터 경로  _(IOChiplet NOC and memory data path)_
+- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — 큐브 NoC와 HBM 연결성  _(Cube NOC and HBM connectivity)_
+
+## Detailed Architecture
+
+One subsection per component file under `src/kernbench/components/builtin/`.
+
+### forwarding
+
+- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding 컴포넌트 (forwarding_v1)
+
+### hbm_ctrl
+
+- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM 컨트롤러 내부 설계
+
+### io_cpu
+
+- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU 컴포넌트 모델
+
+### m_cpu
+
+- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU 및 M_CPU.DMA 컴포넌트 모델
+
+### pcie_ep
+
+- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
+
+### pe_cpu
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_dma
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
+
+### pe_fetch_store
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_gemm
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_ipcq
+
+- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
+
+### pe_math
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_mmu
+
+- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할
+
+### pe_scheduler
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
+
+### pe_tcm
+
+- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — 듀얼 채널 BW 직렬화
+
+### sram
+
+- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
+
+### tiling
+
+- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더
+
+## Implementation Decisions
+
+### Address Scheme
+
+- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51비트 물리 주소 레이아웃 및 디코딩 계약
+- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — 메모리 주소 지정 — PA / VA / LA 주소 모델
+
+### Routing & Helper API
+
+- [ADR-0002](./ADR-0002-lat-routing-distance.md) — 라우팅 거리, 순서 및 우회 규칙
+- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
+
+### Memory Semantics & Local-HBM Bandwidth
+
+- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — 메모리 시맨틱 및 로컬 HBM 대역폭 보장
+
+### Topology Compilation, Diagrams & Builder Algorithms
+
+- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — 다이어그램 뷰 및 거리 기반 레이아웃 규칙
+- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — 토폴로지 컴파일, 거리 추출, 그리고 자동 다이어그램 생성
+- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
+
+### Tensor Deployment and Allocation
+
+- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — 텐서 배포 및 할당 (호스트 할당기, PA 우선)
+
+### Kernel Execution and Host-Device Messaging
+
+- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — 커널 실행 메시징 및 완료 시맨틱
+- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU 메시지 스키마 (PA-우선, PE-태깅)
+
+### CLI Surface and Semantics
+
+- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — 명령줄 인터페이스 및 실행 시맨틱
+
+### Component Port/Wire Fabric Model
+
+- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — 컴포넌트 포트/와이어 모델과 패브릭 라우팅
+
+### Two-Pass Data Execution
+
+- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
+
+### 2D Grid Program Identity
+
+- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D 그리드 program_id 시맨틱
+
+### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
+
+- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
+- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
+- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
+- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
+- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
+
+### IPCQ Direction Addressing
+
+- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
+
+### Intercube All-Reduce
+
+- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — 큐브 간 All-Reduce — pe0 큐브-메시 리듀스 + 다중-SIP 교환
+
+### Evaluation Harnesses
+
+- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce 평가 하니스 — `tests/sccl/`
+- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM 평가 하니스 — `scripts/gemm_sweep.py` + `tests/gemm/`
+- [ADR-0054](./ADR-0054-eval-milestone-benches.md) — 마일스톤 평가 bench — 자기완결적 sweep + figure bench
+
+### Bench Module Contract
+
+- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
+
+### Kernel-side tl.* API (TLContext)
+
+- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
+
+### Memory Allocator Algorithms
+
+- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
+
+### Probe Subcommand
+
+- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
+
+### Sim-engine Op Log and Memory Store Schemas
+
+- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
@@ -7,6 +7,11 @@ Accepted
 Documents the `tests/sccl/` evaluation harness; verified against the
 implementation (constants, file set, and sweep dimensions cross-checked).

+**Amended by ADR-0054**: the driver core, sweeps, and renderers moved into
+the `milestone-1h-ccl` bench (single home); `tests/sccl/_allreduce_helpers.py`
+now re-exports from it (keeping the pytest-only param builders +
+`_run_distributed` wrapper local). The figure tests are unchanged.
+
 ## Context

 ADR-0032 defines the intercube all-reduce *algorithm*; ADR-0023/0024/0027
@@ -9,6 +9,12 @@ implementation (constants, tile sizes, figure set, and the script↔test
 split cross-checked). The D5/D6 caveats are recorded limitations, not
 inaccuracies.

+**Amended by ADR-0054**: the sweep + renderers moved into the
+`milestone-1h-gemm` bench (single home); `scripts/gemm_sweep.py` and
+`tests/gemm/` now re-export from it. D1/D2's "data generation stays a manual
+script / heavy work is opt-in" is superseded by the eval-bench pattern (one
+bench regenerates everything; `MILESTONE_FAST=1` reuses the committed JSON).
+
 ## Context

 ADR-0014 (PE pipeline) and ADR-0042 (tile-plan generators) define the GEMM
@@ -0,0 +1,295 @@
+# ADR-0045: Bench Module Contract — registration, dispatch, and authoring
+
+## Status
+
+Accepted (2026-05-21).
+
+Unifies the `src/kernbench/benches/` registration mechanism (@bench), the
+CLI dispatch path (`kernbench run/list`), and the contract a new bench
+module must follow. ADR-0010 (CLI surface) specifies the `kernbench
+list/run` interface, but **how benches are registered and what signature
+they must follow** had no ADR-level coverage.
+
+**Extended by ADR-0054**: D5's single-config rule gains a third pattern —
+the *eval bench* (e.g. `milestone-1h-*`) drives many configs, builds its
+own per-config engines, and submits a sentinel tensor to satisfy D4.
+
+## First action
+
+When `kernbench.benches` is imported, `__init__.py` immediately calls
+`_eager_import_and_audit(__path__, __name__)`. Its first action is to
+enumerate every sibling module in the package directory via
+`pkgutil.iter_modules(__path__)` and **eagerly import** each one via
+`importlib.import_module(...)` — except modules matching either:
+
+- name `registry` (the infrastructure module itself), or
+- name starting with `_` (helper modules).
+
+At import time, each `@bench(name=..., description=...)` decorator inside
+the imported module runs, appending `(name, description, fn)` to
+`_PENDING` and adding `fn.__module__` to `_REGISTERED_MODULES`.
+
+Once imports finish, `_audit_modules(imported, _REGISTERED_MODULES)`
+runs; if any imported module did not invoke `@bench` at least once, it
+raises `RuntimeError("Bench module(s) missing @bench decorator: ...")`
+immediately. At this point indices are still unassigned — the first call
+to `list_all()` / `resolve(...)` triggers `_finalize()`, which sorts
+`_PENDING` alphabetically by name and assigns 1-based indices.
+
+In short, **the bench infrastructure's first act is "eagerly import
+every non-helper module in the package and audit that each one
+registered at least one bench"**.
+
+## Context
+
+`src/kernbench/benches/` currently holds 8 bench modules (`ccl_allreduce`,
+`gemm_single_pe`, `gpt3_qkv`, `ipcq_allreduce`, `matmul_composite`,
+`qkv_gemm`, `qkv_gemm_multi_pe`, `va_offset_verify`). Every bench follows
+the same unified flow:
+
+```
+kernbench run --topology <T> --bench <N>
+   ↓
+cli/main.py::cmd_run
+   ↓  resolve_topology(T)  + resolve(N)  + resolve_device(device_arg)
+   ↓
+runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
+   ↓  engine_factory(topology, device) → GraphEngine
+   ↓  RuntimeContext(engine, target_device, correlation_id, spec)
+   ↓
+bench_fn(ctx)        ← invokes the bench's run(torch)
+   ↓  ctx.empty/zeros/from_numpy/launch/distributed.* etc. submit work
+   ↓
+ctx.wait_all()       ← drains any outstanding handles
+   ↓
+BenchResult(completion, correlation_id, trace, traces, engine)
+```
+
+ADR-0010 covers only the CLI surface (`run/list/probe/web`); ADR-0007
+covers only the runtime API ↔ sim_engine boundary. The question "what
+shape must a new bench file take?" had to be answered by grepping the
+codebase. As a result:
+
+- The @bench decorator contract (kebab-case name, non-empty description)
+  lived only in the source.
+- The bench function signature (`def run(torch)`) was a de-facto
+  convention enforced by the CLI dispatcher calling `spec.run`.
+- New bench authors learned the "helpers must use `_` prefix" rule only
+  after seeing the audit's RuntimeError.
+- The single-device convention (CLAUDE.md Part 2 CLI Semantics) and its
+  interaction with multi-SIP CCL benches was ambiguous for bench
+  authors.
+
+This ADR consolidates all of it in one place.
+
+## Decision
+
+### D1. @bench decorator contract
+
+```python
+from kernbench.benches.registry import bench
+
+@bench(name="my-bench", description="Short, complete-sentence description.")
+def run(torch):
+    ...
+```
+
+- `name`: kebab-case string matching `^[a-z][a-z0-9]*(-[a-z0-9]+)*$`.
+  Lowercase letters, digits, and dashes only; underscores forbidden;
+  must start with a letter.
+- `description`: non-empty string (stripped length > 0). Displayed
+  verbatim by `kernbench list`.
+- The decorator **returns the function unchanged** — direct invocation
+  is fine. Its only side effect is appending to `_PENDING`.
+
+Violations of the first two rules raise `ValueError` at decoration time.
+Duplicate names are caught at `_finalize()` with
+`RuntimeError("duplicate bench name: ...")`.
+
+### D2. Module-file convention
+
+Every `src/kernbench/benches/<slug>.py` must be one of:
+
+- **A bench module**: at top-level import, `@bench(...)` runs at least
+  once to register at least one bench.
+- **A helper module**: the filename starts with `_` (e.g.,
+  `_shared_helpers.py`). `iter_modules` skips it.
+
+The audit (`_audit_modules`) rejects any non-helper that fails to call
+`@bench`. Intended consequence: dropping a new file into `benches/`
+automatically registers its benches, and helper modules are clearly
+flagged by their filename prefix alone.
+
+### D3. The bench function signature is `def run(torch)`
+
+The decorator does not enforce a function name, but **CLI dispatch calls
+`spec_entry.run`** (the decorated callable). The convention is therefore:
+
+- Function name: `run`. Other names work, but always use `run` for
+  readability and grep-ability.
+- Argument: a single positional `torch`. In practice this is a
+  `RuntimeContext` instance exposing PyTorch-style namespaces
+  (zeros/empty/launch/distributed/...) — see ADR-0024 D3.
+- Return value: any (`Any`). `run_bench` ignores it and tracks
+  completion via `ctx.handles()` / `engine.get_completion()`.
+
+The `torch` name imitates a PyTorch-compatible idiom; the actual PyTorch
+module is not passed in (aligned with ADR-0024's "rank = SIP" launcher
+convention).
+
+### D4. A bench must submit at least once
+
+If `ctx.handles()` is empty after the bench returns, `run_bench` reports
+`BenchResult.completion = ok=False, error_code="NO_REQUESTS"`. So a
+meaningful bench must invoke at least one of:
+
+- Tensor-creation APIs: `torch.zeros(...)`, `torch.empty(...)` — these
+  internally submit `MmuMapMsg` and (for zeros) `MemoryWriteMsg`.
+- Kernel-launch API: `torch.launch(name, fn, *args)` — submits per-SIP
+  `KernelLaunchMsg`.
+- (Exception) Empty placeholder benches: e.g.,
+  `ipcq_allreduce.py`'s `print(...)`-only stub will receive a
+  NO_REQUESTS result. CI is expected to recognize and handle placeholder
+  benches specially.
+
+### D5. Single-device convention + multi-SIP exception (ADR-0024/0027)
+
+CLAUDE.md Part 2 CLI Semantics' **"benchmarks MUST remain
+single-device"** rule is interpreted as follows:
+
+- **Standard bench (single-SIP use)**: define tensor placement with
+  `dp = DPPolicy(...)` and launch with `torch.launch(...)`. The SIP
+  index is chosen by `--device` (CLI's responsibility).
+- **CCL bench (multi-SIP use)**: as an exception, use
+  `torch.distributed.init_process_group(backend="ahbm")` plus
+  `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` for the
+  rank = SIP pattern (ADR-0024 D3). `--device` is ignored (or treated
+  as `all`); each spawned worker calls `torch.ahbm.set_device(rank)` to
+  bind to its SIP.
+
+Multi-device patterns outside these two (e.g., one bench function
+launching across multiple SIPs in the same process) are forbidden by
+this ADR. Even with `--device all`, the bench runs once; multi-SIP use
+inside that single run must follow D5's second pattern.
+
+### D6. Name/index resolution (`resolve`)
+
+`resolve(identifier: str)` returns a BenchSpec via:
+
+1. If `identifier.isdigit()`: convert to int and find the spec where
+   `index ==` that value. If none, `ValueError("No bench with index
+   ...")`.
+2. If `identifier in _REGISTRY`: direct lookup.
+3. Otherwise: `ValueError("Unknown bench ...")`.
+
+Empty or whitespace-only identifiers raise `ValueError("bench
+identifier must be a non-empty string.")`.
+
+The CLI passes `--bench` directly to `resolve`, so users can use either
+`kernbench run --bench gemm-single-pe` or `kernbench run --bench 2`.
+
+### D7. Indices are not a stable API
+
+`_finalize()` sorts `_PENDING` alphabetically by name and assigns
+1-based indices. Adding a new bench can shift existing benches'
+indices. Therefore:
+
+- Human-interactive use: indices are fine.
+- Scripts / CI automation: always use the name.
+
+This caveat is documented in `registry.py`'s module docstring.
+
+### D8. Surface RuntimeContext exposes to benches
+
+A bench's `torch` parameter may legitimately use:
+
+- **Tensor creation**: `torch.empty(shape, dtype=..., dp=DPPolicy(...),
+  name=...)`, `torch.zeros(...)`, `torch.from_numpy(arr)`. All submit
+  host-side metadata plus device deployment (`MmuMapMsg` +
+  `MemoryWriteMsg`).
+- **Kernel launch**: `torch.launch(kernel_name, kernel_fn, *args)` —
+  converts `(Tensor, int, float)` positional args to `TensorArg` /
+  `ScalarArg`, submits per-SIP `KernelLaunchMsg`, and drains.
+- **Synchronization**: `torch.wait(handle)`, `torch.wait_all()`
+  (`run_bench` calls the latter automatically).
+- **Distributed**: `torch.distributed.init_process_group(backend="ahbm")`,
+  `torch.distributed.get_world_size()`,
+  `torch.distributed.all_reduce(t, op=...)` (ADR-0024/0027).
+- **Multi-process (rank = SIP)**:
+  `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` (ADR-0024 D3 /
+  ADR-0027).
+- **Device binding**: `torch.ahbm.set_device(rank)` or
+  `torch.accelerator.set_device_index(rank)` (both point to the same
+  namespace).
+- **IPCQ install**: `torch.install_ipcq(algorithm=..., ccl_yaml=...)`
+  (ADR-0023 D10).
+- **Spec lookup**: `torch.spec` — the dict produced by the topology
+  builder (system / cube_mesh / HBM parameters etc.). Use it so the
+  bench does not hardcode topology.yaml values.
+
+Benches must not access RuntimeContext private members (`_handles`,
+`_traces`, `_allocators`, etc.) directly. This aligns with ADR-0007's
+layer-boundary spirit: bench → runtime API → sim_engine flows in one
+direction.
+
+### D9. Environment-variable parameterization is allowed
+
+Benches may parameterize themselves via `os.environ.get(...)`, as
+`matmul_composite.py` does for `MATMUL_M`, `MATMUL_K`, `MATMUL_N`,
+`MATMUL_DTYPE`, `MATMUL_VARIANT`. Rationale:
+
+- The bench function signature is fixed by D3 to `def run(torch)`, so
+  positional/keyword arguments cannot carry parameters.
+- The env-var pattern is a natural hook for operational sweeps (e.g.,
+  `MATMUL_VARIANT`).
+- External drivers such as `scripts/gemm_sweep.py` (ADR-0044) consume
+  this hook (it sets `MATMUL_M/K/N/VARIANT` at
+  `scripts/gemm_sweep.py:115-118`).
+
+When environment variables alter bench behavior, the module docstring
+must list every variable used (`matmul_composite.py` is the canonical
+example).
+
+## Alternatives Considered
+
+### A1. An explicit manifest file (YAML) listing benches
+
+Rejected. The `@bench` + audit pattern guarantees "drop in file → auto-
+register", concentrating cognitive cost in one place (the file itself).
+A separate manifest is prone to drift, and helper separation is already
+clear via the `_` prefix.
+
+### A2. Allowing the bench's entry-point name in the decorator
+(`@bench(name=..., entry="run_xxx")`)
+
+Rejected. Breaks the simplicity of dispatch (`spec.run` is a single
+callable). The `run` convention is sufficient; variants can register
+multiple `@bench`-decorated functions in the same module.
+
+### A3. A separate `@multi_device_bench` decorator for CCL
+
+Rejected. The two patterns named in D5 (single + ADR-0024 multi-SIP)
+cover all 8 current benches. A separate decorator would force dispatch
+to branch and add complexity; the multi-SIP intent is already obvious
+from the bench's `init_process_group(...)` call.
+
+### A4. Make indices a stable API (registration order or explicit
+`index=` argument)
+
+Rejected. D7's trade-off favors user-friendliness — alphabetically
+sorted 1-based indices read naturally in the `list` output. Scripts can
+use names.
+
+## Consequences
+
+- "How to add a bench" is consolidated in one ADR — new authors only
+  need to read D1-D3 and D8 without grepping source.
+- The `_`-prefixed helper-module pattern is legitimized at ADR level,
+  so future `benches/_*.py` shared helpers are free to be added.
+- The CLI's single-device convention and CCL's multi-SIP exception are
+  shown to be consistent (D5) — they are orthogonal.
+- The rationale for ADR-0044's GEMM eval harness using env-var hooks
+  (D9) is now ADR-pinned.
+- Indices are explicitly unstable (D7), so any CI code calling
+  `kernbench run --bench 3` is flagged for review after this ADR is
+  accepted.
@@ -0,0 +1,327 @@
+# ADR-0046: TLContext — Kernel-side `tl.*` API Contract
+
+## Status
+
+Accepted (2026-05-22).
+
+Documents the set of `tl.*` primitives exposed by
+`src/kernbench/triton_emu/`'s `TLContext`, their semantics, and the two
+execution-mode contracts (command-list / greenlet runner). ADR-0014/0020
+defines the PE pipeline and the 2-pass execution model, but **the `tl.*`
+surface that bench kernel functions call** had no ADR-level coverage.
+
+## First action
+
+When `TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id,
+num_cubes, scratch_base, scratch_size)` is instantiated, the first action
+is to initialize six categories of state:
+
+- `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
+  values that `tl.program_id` / `tl.num_programs` will return.
+- `self._dispatch_cycles` — cycle count emitted as `PeCpuOverheadCmd(cycles)`
+  at the start of every `tl.*` API call.
+- `self._runner` — `KernelRunner` instance (present → greenlet mode;
+  absent → command-list mode).
+- `self._commands: list[PeCommand] = []` — command-list accumulator
+  (command-list mode only).
+- `self._handle_counter = 0`, `self._completion_counter = 0` — counters
+  for generating TensorHandle / CompletionHandle ids.
+- `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
+  PE-local scratch region (used for math/dot/composite output handle
+  addresses).
+
+In short, **TLContext's first act is "record where (sip/cube/pe) and at
+what scale (num_programs/num_cubes) this kernel instance runs, and pick
+its dispatch mode (runner present or not)"**. No SimPy event is created
+and no command is emitted at this moment.
+
+The runtime first action happens when the kernel function first calls a
+`tl.<api>()`. The standard entry for every `tl.*` API is:
+
+1. Call `self._emit_dispatch_overhead()` — if `dispatch_cycles > 0`,
+   immediately `_emit` a `PeCpuOverheadCmd(dispatch_cycles)`.
+2. Per-API processing (TensorHandle creation, command construction).
+3. `self._emit(cmd)` — in runner mode this `greenlet.switch()`es the cmd
+   to SimPy; in command-list mode it appends to `self._commands`.
+
+## Context
+
+The `tl.*` surface consists of `TLContext`'s methods, and the `tl`
+parameter received by a kernel function is one of these objects. The
+contract the user (bench author) sees:
+
+- Which primitives exist.
+- What data flow each primitive triggers (DMA / compute / IPCQ /
+  metadata-only).
+- How a TensorHandle's `space` and `addr` are decided.
+- The difference between command-list and greenlet modes.
+
+ADR-0014 (PE pipeline) defines the PeCommands consumed by PE_SCHEDULER,
+but how `tl.*` emits them is a code-only convention. ADR-0020 (2-pass
+data execution) mentions greenlet mode in D3 but does not pin down the
+signature difference (return-value handling) between the runner /
+non-runner paths. This ADR fills the gap.
+
+## Decision
+
+### D1. The `tl` parameter is a `TLContext` instance
+
+A bench kernel function has the signature:
+
+```python
+def _kernel(arg1, arg2, ..., tl, **kwargs):
+    ...
+```
+
+`tl` is a `kernbench.triton_emu.tl_context.TLContext` instance. The name
+imitates real Triton's `triton.language` module; the actual Triton
+module is **not** passed in.
+
+The kernel is plain Python — no `yield` or `async`. `tl.*` calls produce
+SimPy events, but to the caller they appear synchronous because in
+greenlet mode the KernelRunner relays between SimPy and the kernel
+(ADR-0020 D3).
+
+### D2. Two execution modes — command-list / greenlet runner
+
+- **Command-list mode (`runner is None`)**: `tl.*` calls append PeCommand
+  to `self._commands`. DMA / GEMM / Math consume no SimPy time and return
+  metadata-only TensorHandles (`data=None`). PE_SCHEDULER / sim_engine
+  later replays the command sequence in time.
+
+- **Greenlet runner mode (`runner is not None`)**: `tl.*` calls
+  `self._emit(cmd)` → `runner.switch_to_simpy(cmd)`, handing control to
+  the parent greenlet (SimPy). The parent distributes the cmd to
+  components, consumes SimPy time, and (for DMA reads) returns real numpy
+  data. The kernel receives the result and continues to the next line
+  (the data-aware execution model from ADR-0020 D3).
+
+The choice of mode is decided by whether a KernelRunner is injected into
+the TLContext. The `tl.*` methods themselves are mode-blind — they go
+through `_emit()` uniformly.
+
+### D3. Primitive categories
+
+#### D3.1. Reference (no DMA, metadata only)
+
+- `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: create a handle
+  referencing HBM data without issuing DMA. Used when the scheduler
+  streams the data per-tile (e.g., the b operand of a composite GEMM).
+
+#### D3.2. Data movement (blocking, DMA engine)
+
+- `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → handle.
+  Emits `DmaReadCmd`. In greenlet mode the returned handle's `.data`
+  carries real numpy data; in command-list mode it is a placeholder.
+  The handle has `space="hbm"`, `pinned=True`.
+- `tl.store(ptr, handle) -> None`: TCM → HBM. Emits `DmaWriteCmd`. In
+  greenlet mode, when `handle.data` is present, `_store.write("hbm",
+  ptr, data)` runs first (visibility = issue time, ADR-0020 D3).
+
+#### D3.3. GEMM / compute (blocking)
+
+- `tl.dot(a, b) -> TensorHandle`: `a @ b`. Both operands must live in
+  TCM; shapes `(M,K) × (K,N) → (M,N)`. Emits `GemmCmd`; the output
+  handle is allocated from PE-local scratch via
+  `_make_compute_out(shape, dtype)`.
+- `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
+  acc_dtype=None, tile_shape=None) -> CompletionHandle`: non-blocking
+  tiled pipeline. Emits `CompositeCmd`. `epilogue` is a list of dicts,
+  each with `"op"` plus op-specific fields and an optional `"scope"`
+  (k_tile / output_tile). Unknown ops or missing fields raise
+  ValueError immediately. The returned CompletionHandle synchronizes
+  via `tl.wait(h)`.
+
+#### D3.4. Math: unary (blocking)
+
+- `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
+  `tl.cos(x)`, `tl.sin(x)` — each emits `MathCmd(op=<name>,
+  inputs=(x,), out=)`. `out` is scratch-allocated with the same
+  shape/dtype as `x`.
+
+#### D3.5. Math: binary (blocking)
+
+- `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
+- `tl.fma(a, b, c)` — `a*b + c`. Three inputs.
+- `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
+- `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
+- `tl.softmax(x, axis=-1)` — a single `MathCmd(op="softmax")` so timing
+  accounts at one dispatch. Phase 2 DataExecutor expands it to the
+  canonical (x-max → exp → sum → div) sequence.
+
+#### D3.6. Reduction (blocking)
+
+- `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — return an
+  output handle with the axis size collapsed to 1. Emit
+  `MathCmd(op=<name>, inputs=(x,), out=, axis=axis)`.
+
+#### D3.7. Index / scalar (PE_CPU, no engine)
+
+- `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE
+  index), `axis==1` → cube_id (ADR-0022).
+- `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (PEs per
+  cube), `axis==1` → num_cubes.
+- `tl.arange(start, end, dtype="i32") -> TensorHandle`: an index range
+  in TCM. No command emitted.
+- `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape,
+  value, dtype="f16") -> TensorHandle`: TCM placeholder. No command
+  emitted.
+
+#### D3.8. Scalar helpers (no command, no engine)
+
+- `TLContext.cdiv(a, b) -> int` (static): ceiling division
+  `-(-a // b)`. Mirrors real Triton's `tl.cdiv`.
+
+#### D3.9. Metadata-only (no compute, no DMA)
+
+- `tl.trans(x) -> TensorHandle`: a new handle with the last two dims
+  swapped. Shares `addr` and `data`; no command emitted.
+
+#### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
+
+- `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
+  dtype="f16", space="tcm") -> None`: blocking send. Accepts either
+  handle form or raw-address form. Emits `IpcqSendCmd`. The handle's
+  `.data` snapshot rides along on the command — avoiding the race
+  where a later inbound IPCQ overwrites the slot before the outbound
+  PE_DMA reads it.
+- `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
+  dst_space=None) -> TensorHandle`: blocking recv. Providing both
+  `dst_addr` and `dst_space` enters "copy_to_dst" mode; otherwise
+  "return_slot" mode. In greenlet mode the handle's `.data` carries
+  the real data.
+- `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
+  **DIAGNOSTIC ONLY**. Has the same blocking-arrival semantics as
+  `tl.recv` but skips the slot-read latency charge (slot-IO + PE↔bank
+  fabric drain). Used in the pe2pe overview plot for an apples-to-apples
+  comparison against `tl.store`. Production kernels MUST NOT use it —
+  the diagnostic flag is isolated in its own command branch
+  (`consume=False`) so it cannot be accidentally enabled.
+- `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
+  recv. Returns a `RecvFuture`; resolved later by `tl.wait(future)`.
+
+#### D3.11. Composite + control
+
+- `tl.composite(...)`: see D3.3.
+- `tl.wait(handle=None)`: wait on a `CompletionHandle` (composite), a
+  `RecvFuture` (async recv), or `None` (all pending composites).
+- `tl.cycles(n)`: declare a scalar PE_CPU overhead. Emits
+  `PeCpuOverheadCmd(cycles=n)`.
+
+### D4. TensorHandle arithmetic operators — thread-local TLContext
+
+At module load, `tl_context.py::_enable_tensor_ops()` runs and patches
+`TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__`. Each
+operator calls `_binary_math` on the active TLContext stored in a
+module-level thread-local `_ctx`.
+
+So inside a kernel, `c = a + b` is equivalent to emitting
+`MathCmd(op="add", inputs=(a, b), out=)` and returning a new
+TensorHandle.
+
+Active-TLContext management:
+
+- `TLContext._set_active(ctx)`: set the active ctx for the current
+  thread/greenlet.
+- `TLContext._get_active()`: read it (RuntimeError if unset).
+- `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. Sets active
+  on entry, runs the kernel, restores `None` on exit.
+
+`KernelRunner` re-asserts `_set_active(tl)` inside its `_switch_kernel`
+just before resuming the kernel, so a sibling PE runner that overwrote
+the thread-local context is correctly recovered.
+
+### D5. Scratch allocator — compute output handles
+
+Ops that produce a result — `tl.dot`, `tl.exp`, `tl.add` (via
+TensorHandle `__add__`), etc. — call `_make_compute_out(shape, dtype)`
+to obtain a 16-byte-aligned scratch address. The address is published
+with `space="tcm"`, so the handle can later be the source of a
+`tl.send` / `tl.store`.
+
+When `_scratch_base == 0` (e.g., command-list mode), the address is 0
+and the handle cannot be a send/store source (in that case, only
+`tl.load`-returned handles are valid sources).
+
+When the cursor exceeds `_scratch_size` (default 1 MiB), a
+RuntimeError is raised. The cursor must reset between kernel
+invocations (current code naturally satisfies this: KernelRunner
+creates a fresh TLContext each time).
+
+### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
+
+Every non-metadata `tl.*` call starts with `_emit_dispatch_overhead()`,
+which — when `dispatch_cycles > 0` — emits
+`PeCpuOverheadCmd(dispatch_cycles)`. This models the cycles PE_CPU
+spends dispatching the command.
+
+Defaults:
+
+- `TLContext.__init__`'s `dispatch_cycles` parameter default: `1` cycle.
+- TLContext built by `KernelRunner`: `0` cycles (greenlet mode handles
+  cycle accounting differently — aligned with ADR-0020 D3 intent).
+
+### D7. Kernel registry (`triton_emu/registry.py`)
+
+A separate `_kernels: dict[str, Callable]` holds the name → function
+mapping:
+
+- `register_kernel(name, fn)`: ValueError on duplicate.
+- `get_kernel(name)`: KeyError if missing.
+- `clear_registry()`: test-only.
+
+`RuntimeContext.launch(kernel_name, kernel_fn, *args)` overwrites
+`_kernels[kernel_name] = kernel_fn` on every call (last-call-wins,
+idempotent) — consistent with ADR-0045 D8's `launch` behavior.
+
+PE_CPU looks up `KernelRef.name` in the registry and runs the function
+through KernelRunner.
+
+## Alternatives Considered
+
+### A1. Fold `tl.*` into ADR-0014 / ADR-0020
+
+Rejected. ADR-0014 covers the PE pipeline (sim_engine-side consumption
+of PeCommands); ADR-0020 covers 2-pass execution (Phase 1 timing /
+Phase 2 data). The `tl.*` surface is what the kernel author touches; a
+dedicated ADR improves findability and onboarding.
+
+### A2. Deprecate command-list mode
+
+Rejected (currently). Simple unit tests and kernel verification benefit
+from the lighter command-list path — it exposes a PeCommand sequence
+inspector without requiring greenlet machinery. When greenlet-mode
+semantics (real data, Phase 2) are needed, D2 explicitly selects them.
+
+### A3. Remove TensorHandle arithmetic operators
+
+Rejected. They mimic real Triton kernel ergonomics (e.g., `c = a + b`),
+and the thread-local active-ctx pattern works cleanly. The explicit
+function-form (`tl.add(a, b)`) is also exposed in D3.5, so the
+operators are syntactic sugar.
+
+### A4. Expand softmax into the explicit sequence (max → exp → sum → div)
+
+Partially adopted. `tl.softmax` is a single `MathCmd(op="softmax")` for
+timing accounting (D3.5), but Phase 2 DataExecutor expands it to the
+canonical sequence for real-data computation. Timing model atomic,
+data model expanded — the two split intentionally.
+
+## Consequences
+
+- Every `tl.*` primitive a bench author meets is classified and defined
+  in a single ADR. Paired with ADR-0045 D8's host-side surface
+  (`torch.empty` etc.), the inside-kernel and outside-kernel authoring
+  guides are now complete.
+- The command-list / greenlet difference is pinned in D2, so any new
+  `tl.*` primitive that follows the `_emit()` pattern auto-supports
+  both modes.
+- The thread-local active-ctx pattern (D4) is justified at ADR level,
+  clarifying who owns the reset responsibility when multiple PE
+  runners share a thread (KernelRunner.run's contract restores active
+  inside `_switch_kernel`).
+- `tl.recv_no_consume`'s diagnostic isolation (D3.10) is hardened in
+  ADR form — accidental production use is blocked by a separate
+  command branch.
+- The registry (D7) gets its own D-section, formalizing the
+  name-collision and dynamic-re-registration semantics.
@@ -0,0 +1,259 @@
+# ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
+
+## Status
+
+Accepted (2026-05-22).
+
+Pins down what `runtime_api/distributed.py`'s `AhbmCCLBackend` +
+`DistributedContext` actually install — i.e., the entry point
+`torch.distributed.init_process_group(backend="ahbm")` — and how
+`all_reduce`/`barrier`/`get_rank` etc. are implemented. ADR-0023 D11
+mentions the "torch.distributed compatibility" intent, but **the backend
+itself** had no ADR-level coverage.
+
+## First action
+
+`RuntimeContext.__post_init__` automatically constructs a
+`DistributedContext()` and attaches it to `self.distributed`. The first
+action at that moment:
+
+1. `self._backend: AhbmCCLBackend | None = None` — uninitialized.
+2. `self._rank_by_greenlet: dict = {}` — greenlet-local rank registry
+   (ADR-0024 D2).
+3. The caller (RuntimeContext) sets `dc._ctx_ref = self` so subsequent
+   `init_process_group` can reach `ctx.engine` / `ctx.spec` / `ctx.launch`.
+
+In short, **DistributedContext's first act is "attach to RuntimeContext
+with a back-reference and leave the backend slot empty"**. Actual
+backend installation (IPCQ install, world_size derivation, algorithm
+module import) happens only when user code calls
+`torch.distributed.init_process_group(backend="ahbm")`.
+
+At that moment, `init_process_group`'s first action is:
+
+1. If `backend != "ahbm"`, raise `ValueError("Unsupported backend ...")`
+   immediately.
+2. If `getattr(self, "_ctx_ref", None)` is None,
+   `RuntimeError("DistributedContext not bound to a RuntimeContext")`.
+3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — inside this
+   constructor, ccl.yaml is loaded, the algorithm module is imported,
+   world_size is derived, SFR is configured, and IPCQ is installed.
+4. `self._backend._dist_ctx = self` — the backend gets a back-reference
+   so it can read `_rank_by_greenlet`.
+
+## Context
+
+The `AhbmCCLBackend` exists so that PyTorch DDP collective calls
+(`init_process_group`, `all_reduce`, etc.) work unchanged and bench code
+reads identically to a real DDP training script (in line with
+ADR-0024 + ADR-0027's launcher model).
+
+The backend's responsibilities:
+
+- At `init_process_group` time, install the **IPCQ neighbor table once**
+  (analogous to NCCL communicator creation).
+- For each `all_reduce(tensor, op="sum")`, dispatch the configured
+  algorithm's kernel function via `ctx.launch(...)`.
+- Answer `get_world_size` / `get_rank` consistently from the
+  greenlet-local rank registry plus ccl.yaml/topology.
+
+ADR-0023 D10 (IPCQ install plan) and ADR-0024 (SIP launcher) touch
+parts of this, but **the backend's own responsibility scope and decision
+order** are not pinned anywhere. This ADR fills that gap.
+
+## Decision
+
+### D1. The backend is created only at `init_process_group(backend="ahbm")` time
+
+`DistributedContext` starts with `_backend = None`. The backend object
+does not exist until the user calls
+`dist.init_process_group(backend="ahbm")`. Any other API
+(`is_initialized`, `get_world_size`, `all_reduce`, `barrier`) called
+while `_backend` is None raises
+`RuntimeError("Default process group has not been initialized...")` via
+the `_ensure_initialized` helper.
+
+`backend != "ahbm"` raises `ValueError` immediately. Other backend names
+(`nccl`, `gloo`, etc.) are not recognized.
+
+### D2. world_size resolution priority — algorithm > defaults > topology
+
+`AhbmCCLBackend._resolve_world_size` (ADR-0024 D1):
+
+1. If `ccl.yaml`'s algorithm entry has `world_size`, use it.
+2. Else if `defaults.world_size` is set, use it.
+3. Else fall back to `spec.system.sips.count` (the topology's SIP count).
+
+The default interpretation is **rank = SIP** (ADR-0024). Cube/PE-level
+parallelism is expressed inside each rank via DPPolicy and does not
+affect world_size. An explicit `ccl.yaml` override is preserved for the
+legacy "rank = flat PE index" test path.
+
+User arguments to `init_process_group(world_size=..., rank=...)` are
+**accepted but ignored** (same as real PyTorch's `RANK` / `WORLD_SIZE`
+env vars).
+
+### D3. `init_process_group` performs four installation steps
+
+Inside `AhbmCCLBackend.__init__`, in order:
+
+1. **Load ccl.yaml**: `kernbench.ccl.install.load_ccl_config()` →
+   `resolve_algorithm_config(_cfg_all)` produces the merged config for
+   `defaults.algorithm` (or the user-specified algorithm).
+2. **Import algorithm module**:
+   `importlib.import_module(self._merged["module"])`. The module must
+   expose a `kernel` function, a `kernel_args(world_size, n_elem,
+   cube_w, cube_h)` helper, and optionally a `TOPO_NAME_TO_KIND` map.
+3. **Resolve world_size** (D2).
+4. **Collect topology metadata** from `spec`: `n_sips`, `sip_topo`
+   (`ring_1d` default), `cube_w`/`cube_h`, `sips.w`/`sips.h`. When the
+   SIP topology is not `ring_1d`, derive `_sip_topo_w/h` from explicit
+   `w`/`h` or via square-root (require `w*h == n_sips`). Mismatch raises
+   `ValueError`.
+5. **Install SFR + IPCQ**:
+   `kernbench.ccl.sfr_config.configure_sfr_intercube_multisip(engine,
+   spec, self._merged)`. This pushes IPCQ neighbor tables to every
+   SIP/cube's pe0 (one-time setup analogous to NCCL communicator
+   creation).
+
+If the order changes (e.g., SFR runs before the algorithm module
+loads), partial initialization can result. So D3 is treated as an
+atomic 4-step block — on failure the backend remains uninstalled.
+
+### D4. Greenlet-local rank binding (ADR-0024 D2)
+
+`DistributedContext._rank_by_greenlet: dict[greenlet, int]` maps spawned
+worker greenlets to their ranks. When the bench launcher (e.g.,
+`torch.multiprocessing.spawn`) spawns a worker, it registers via
+`dc._bind_rank(g, rank)`.
+
+`get_rank()` looks up `getcurrent()`'s greenlet. Unregistered greenlets
+fall back to 0 — preserves single-driver / test compatibility.
+
+The backend reads the current greenlet's rank from
+`_dist_ctx._rank_by_greenlet` during `all_reduce` (D5).
+
+### D5. `all_reduce(tensor, op="sum")` behavior
+
+Validation:
+
+- `op != "sum"` → `NotImplementedError`. Current kernels only
+  implement add reduction.
+- `tensor._handle is None` → `RuntimeError("not deployed")`.
+- `tensor._handle.shards` empty → `RuntimeError("no shards")`.
+
+Preparation:
+
+- `n_elem = shards[0].nbytes // tensor.itemsize` — element count of a
+  single shard.
+- `kernel_fn = self._algo_module.kernel` — the algorithm module's entry
+  function (imported in D3).
+- Decide effective cube dims: if the first SIP has just 1 cube, use
+  `(1, 1)`; otherwise use the topology's `cube_w`/`cube_h`. This
+  naturally absorbs TP runs that use only a subset of cubes.
+- `kernel_args = self._algo_module.kernel_args(world_size, n_elem,
+  cube_w, cube_h)` — the algorithm decides which arguments to pass to
+  its kernel.
+
+Dispatch:
+
+- Resolve the current greenlet's rank via
+  `_rank_by_greenlet.get(g, 0)`.
+- Append `extra_args = (sip_rank, sip_topo_kind, sip_topo_w,
+  sip_topo_h)`.
+- `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor,
+  *kernel_args, *extra_args, _defer_wait=True)` — `_defer_wait=True`
+  delegates collective drain to the main scheduler (ADR-0027 D0.4).
+
+Drain:
+
+- If the parent greenlet is alive (multi-greenlet mode), enqueue
+  `_pending_collective_handles` and switch to parent. The main
+  scheduler drains after all ranks have launched.
+- If single-driver mode, drain inline:
+  `for h, _sip_id, meta in pending: self.ctx.wait(h, _meta=meta)`.
+
+### D6. `barrier()` is a no-op (single-driver model)
+
+kernbench runs all ranks as greenlets inside a single Python process,
+so no cross-process synchronization is needed. `barrier()` is callable
+but does no synchronization. Kept for real-PyTorch API compatibility so
+callers don't get `NotImplementedError`.
+
+If multi-process kernbench (SimPy event loop per process) is introduced
+in the future, D6 needs a superseding ADR.
+
+### D7. Semantics of `get_rank` / `get_world_size` / `get_backend`
+
+- `get_rank()` (D4): the current greenlet's bound rank; unregistered → 0.
+- `get_world_size()` (D2): the world_size resolved by the backend in D3.
+- `get_backend()`: always the literal string `"ahbm"`. Calling before
+  backend exists triggers `_ensure_initialized`'s RuntimeError.
+
+Differences vs. real PyTorch:
+
+- Real PyTorch `get_rank()` is a process-global value; here it is
+  greenlet-local. Inside a spawned worker → the worker's rank; in the
+  main thread → 0. Bench authors should expect meaningful ranks only
+  inside worker functions.
+
+### D8. Supported API surface (final)
+
+`DistributedContext` exposes:
+
+- `init_process_group(backend="ahbm", world_size=None, rank=None,
+  **kwargs)`
+- `is_initialized() -> bool`
+- `get_world_size() -> int`
+- `get_rank() -> int`
+- `get_backend() -> str`
+- `all_reduce(tensor, op="sum") -> None`
+- `barrier() -> None`
+- (internal) `_bind_rank(g, rank)`
+
+Other PyTorch distributed APIs (`broadcast`, `reduce`, `all_gather`,
+`gather`, `scatter`, point-to-point `send/recv`, etc.) are **not
+implemented**. Kernel-level expression is available via
+`tl.send`/`tl.recv` (ADR-0046 D3.10), but the `dist.*` surface does not
+expose them. If additional collectives are needed, add a paired
+(algorithm module, `DistributedContext` method) and extend D8.
+
+## Alternatives Considered
+
+### A1. Create the backend in `RuntimeContext.__init__`
+
+Rejected. If `ccl.yaml` is missing or the algorithm module can't be
+imported, RuntimeContext construction would fail even when the bench
+does not use distributed features. Lazy creation at call time (D1) is
+the right semantics.
+
+### A2. Always derive world_size from topology (no override)
+
+Rejected. ADR-0024 D1's "explicit override" path is used by legacy
+tests. Diagnostic scenarios that define PE-level ranks within a single
+SIP also need this escape hatch.
+
+### A3. Silent fallback for unsupported `op`
+
+Rejected. If the user intends `op="prod"` / `"max"` / `"avg"` and silent
+`sum` runs instead, result validation gets very hard. Explicit
+`NotImplementedError` is safer.
+
+### A4. Implement `barrier` as a SimPy event
+
+Rejected (currently). With single-driver semantics there is no
+cross-process synchronization to express, so a no-op is meaningfully
+correct. A fake-barrier SimPy event would add code complexity for no
+semantic gain. Revisit when multi-process kernbench arrives.
+
+## Consequences
+
+- The 4-step installation (D3) for
+  `torch.distributed.init_process_group(backend="ahbm")` is locked in,
+  making clear where future collective algorithms must hook.
+- The priority order in D2 (algorithm > defaults > topology) makes the
+  blast radius of ccl.yaml changes quickly knowable.
+- The no-op `barrier` (D6) is recorded so multi-process kernbench, if
+  introduced, must explicitly supersede this ADR.
+- D8's list of unsupported APIs explicitly grounds the rejection
+  message when users call, e.g., `dist.broadcast(...)`.
@@ -0,0 +1,278 @@
+# ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
+
+## Status
+
+Accepted (2026-05-22).
+
+Pins down the free-list algorithm, page alignment, and coalescing rules
+used by `policy/address/allocator.py`'s `_FreeList` / `PEMemAllocator`
+and `va_allocator.py`'s `VirtualAllocator`. ADR-0001 (PhysAddr layout)
+and ADR-0011 (PA/VA/LA models) define the address schemes; the
+**allocation algorithms** had no ADR-level coverage.
+
+## First action
+
+### `_FreeList(capacity)`
+
+On construction: `self._capacity = capacity`, `self._used = 0`,
+`self._free = [(0, capacity)]`. The first act is **establishing the
+entire region as one free block** — the tuple `(offset=0,
+size=capacity)` is the sole entry in the free list.
+
+### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
+
+On construction, builds two `_FreeList`s:
+
+- `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — the size of this PE's
+  HBM slice (`hbm_bytes_per_cube // hbm_slices_per_cube`).
+- `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — equals
+  `tcm_bytes_per_pe - tcm_scheduler_reserved_bytes` (the scheduler
+  reservation is pre-deducted).
+
+So PEMemAllocator's first act is **constructing single-free-block
+HBM-slice and TCM regions for this PE**.
+
+### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
+
+On construction: `self._va_base = va_base`, `self._va_size = va_size`,
+`self._page_size = page_size`, `self._used = 0`, `self._free =
+[(va_base, va_size)]`. The first act is **establishing one block from
+va_base to va_size and stashing page_size**.
+
+## Context
+
+`runtime_api/context.py::_ensure_allocators` builds the allocator set
+in these stages:
+
+1. Read `hbm_total_gb_per_cube`, `hbm_slices_per_cube`, `tcm_size_mb`,
+   per-target_device SIP range, etc. from `spec`.
+2. Pack everything into a frozen `AddressConfig`.
+3. For every combination in the target SIP range × cubes × PEs,
+   construct one `PEMemAllocator(sip, cube, pe, cfg)` instance.
+4. Construct one `VirtualAllocator(va_base=0x1_0000_0000, va_size=64
+   GiB, page_size=pe_mmu.page_size)`.
+
+Allocator responsibilities:
+
+- **PEMemAllocator**: PA-space allocation in the PE-local HBM slice /
+  TCM (including PhysAddr encoding).
+- **VirtualAllocator**: device-wide VA allocation, page-aligned.
+  `RuntimeContext._create_tensor` then pushes VA → PA mappings to
+  components via `MmuMapMsg`.
+
+These algorithms are:
+
+- **First-fit**, kept simple.
+- The free-block list is **sorted by start offset**.
+- On `free()`, **adjacent blocks coalesce**.
+
+The rationale was not documented anywhere, so when someone asks "why
+not best-fit?", "why not a buddy allocator?", "why does partial-overlap
+free pass silently?", there was no anchor to answer from. This ADR
+provides it.
+
+## Decision
+
+### D1. `_FreeList` — offset-keyed first-fit + coalescing
+
+`policy/address/allocator.py::_FreeList`:
+
+- Internal representation: `list[tuple[int, int]] = [(start_offset,
+  size), ...]` — sorted by start offset.
+- `alloc(nbytes)`:
+  1. Iterate the free list from the front (first-fit).
+  2. Take from the first block with `size >= nbytes`.
+  3. Exact match → drop the block; otherwise shrink it to `(start +
+     nbytes, size - nbytes)`.
+  4. `_used += nbytes`; return the taken `start`.
+  5. If no block fits, `AllocationError("overflow ... largest free
+     block ...")`.
+- `free(offset, nbytes)`:
+  1. `_used -= nbytes`.
+  2. `bisect_left(self._free, (offset,))` finds the insertion index.
+  3. If adjacent to the previous block (`prev_start + prev_size ==
+     offset`), merge.
+  4. If adjacent to the next block (`offset + nbytes == next_start`),
+     merge.
+  5. Insert the coalesced range at the right sorted position.
+
+This algorithm is weaker than best-fit / buddy on fragmentation, but
+the simulator's workload (mostly stack-like deploy/free) tolerates it.
+If the workload shape changes, D1 is a supersession candidate.
+
+### D2. Partial-overlap free is **not** validated
+
+`_FreeList.free(offset, nbytes)` trusts the caller to pass the exact
+`(offset, nbytes)`. It does **not** verify:
+
+- That the range was actually allocated.
+- That the range does not overlap another allocated region.
+
+Reason: in a simulator context, callers always store the return value
+of `alloc()` and pass it back to `free()` — there is no external user
+input. Adding a safety check would cost O(N) per free and impact
+simulation wall-clock.
+
+If this trust model breaks (e.g., a code path lets two tensors point
+at the same PA), this ADR must be revisited.
+
+### D3. `PEMemAllocator` — two channels for HBM/TCM
+
+`PEMemAllocator(sip_id, die_id, pe_id, cfg)` holds two `_FreeList`s:
+
+- `_hbm`: size `cfg.hbm_slice_bytes`.
+- `_tcm`: size `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
+  tcm_scheduler_reserved_bytes`).
+
+`alloc_hbm(nbytes) -> PhysAddr`:
+
+- `_hbm.alloc(nbytes)` → offset.
+- `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id,
+  pe_local_hbm_offset=offset, slice_size_bytes=cfg.hbm_slice_bytes)`.
+- Failure raises `AllocationError("HBM overflow ...")`.
+
+`free_hbm(pa, nbytes)`:
+
+- Recover PE-local offset via `pa.hbm_offset - pe_id *
+  cfg.hbm_slice_bytes`.
+- `_hbm.free(offset, nbytes)`.
+
+`alloc_tcm(nbytes) -> PhysAddr`: similar; uses `PhysAddr.pe_tcm_addr`.
+
+`free_tcm(pa, nbytes)`: uses `pa.sub_offset` directly (TCM's PE-local
+offset equals its sub_offset).
+
+The allocator does not see the scheduler-reserved TCM region
+(`cfg.tcm_scheduler_reserved_bytes`) — it is pre-subtracted from the
+`_tcm` capacity. This is consistent with ADR-0014's PE_SCHEDULER
+internal-buffer reservation.
+
+### D4. `VirtualAllocator` — page-aligned first-fit + coalescing
+
+`policy/address/va_allocator.py::VirtualAllocator`:
+
+- Internal representation: same sorted `list[tuple[int, int]]` as
+  `_FreeList`. Initially `[(va_base, va_size)]`.
+- `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
+- `alloc(nbytes) -> int`:
+  1. `aligned = _align_up(nbytes)`.
+  2. First-fit a block with `size >= aligned`.
+  3. Take `aligned` from the block's front; remove if exact.
+  4. `_used += aligned`. Return the block's `start` (which is page-
+     aligned).
+  5. Failure → `VaAllocationError`.
+- `free(va, nbytes)`: free `_align_up(nbytes)` worth. Coalesces with
+  the same algorithm as `_FreeList`.
+
+`page_size` has different defaults in two places:
+
+- `VirtualAllocator.__init__`'s parameter default: `2 MiB`. Direct-call
+  tests receive this.
+- `RuntimeContext._ensure_allocators` when constructing the instance:
+  `pe_mmu.attrs.get("page_size", 4096)` — uses
+  `topology.yaml`'s `pe_mmu.attrs.page_size` if set, else falls back
+  to `4 KiB`.
+
+The two defaults differ on purpose: `VirtualAllocator`'s standalone
+default (`2 MiB`) aligns with ADR-0039's PE_MMU stopgap default for
+direct-test ergonomics; the context fallback (`4 KiB`) is the safe
+minimum when `topology.yaml` doesn't specify a page size. The
+production path is always the latter (via `_ensure_allocators`), and
+when `topology.yaml` sets `page_size`, that value flows consistently
+into both the MMU and the VA allocator.
+
+If consistency breaks (e.g., VirtualAllocator instantiated with a
+page_size different from PE_MMU's), MMU `map()` falls into the
+sub-page region mode (ADR-0039 D3).
+
+VA range defaults: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64
+GiB`. These are hardcoded in `_ensure_allocators` and have no
+semantic meaning in ADR-0011's VA model — they simply reserve enough
+device-wide space without colliding with host code.
+
+### D5. Lifecycle of allocator instances
+
+- `RuntimeContext._ensure_allocators` is lazy — called on the first
+  `_create_tensor`.
+- The allocator dict (`self._allocators`) lives for the
+  RuntimeContext's lifetime. A second deploy in the same process
+  does not construct new objects.
+- `RuntimeContext.cleanup()` walks living tensors and calls
+  `_free_tensor()`, which issues MMU unmaps + `va_allocator.free` +
+  `pemem_allocator.free_hbm` — restoring the free lists. A subsequent
+  RuntimeContext starts fresh.
+
+This per-RuntimeContext isolation guarantees deterministic deploy →
+cleanup → deploy sequences within a single process.
+
+### D6. Allocator failure raises (no silent OOM)
+
+Both `_FreeList.alloc` and `VirtualAllocator.alloc` raise
+`AllocationError` / `VaAllocationError` when no block fits. The message
+includes "required size + largest available block" to distinguish
+fragmentation from true OOM.
+
+A silent fallback (e.g., allocating only as much as the largest free
+block) is strictly forbidden — a partially-allocated tensor reaching
+SimPy would cause routing / DMA to see incorrect PAs and silently
+corrupt simulation results.
+
+### D7. One allocator per address space
+
+Physical address spaces are separated by PhysAddr sub-units (ADR-0001
+D2.3); each sub-unit gets its own allocator instance:
+
+- HBM slice → `PEMemAllocator._hbm`.
+- PE TCM → `PEMemAllocator._tcm`.
+- (Currently unused) M_CPU local memory, CUBE SRAM → would need their
+  own allocators. Today these are handled as IPCQ-only slots (ADR-0023
+  D9.7) and do not share PA space, so no free-list exists for them.
+
+When a cube-level SRAM allocator is needed,
+`_FreeList(cfg.sram_bytes_per_cube)` is added per-cube
+(`cfg.sram_bytes_per_cube` is already defined in `AddressConfig` —
+the data model is ready).
+
+## Alternatives Considered
+
+### A1. Best-fit / buddy allocator
+
+Rejected (currently). The workload's alloc/free pattern is stack-like
+(deploy order ≈ free order), so first-fit + coalescing controls
+fragmentation well enough. If long-running fragmentation appears in LLM
+kernel sweeps, a buddy-allocator ADR will replace D1.
+
+### A2. Add partial-overlap free validation
+
+Rejected. D2's trust model plus the O(N) per-free cost makes this
+unattractive. A debug mode (e.g., `KERNBENCH_DEBUG` env var) that
+enables the check could be added later.
+
+### A3. A unified allocator for VA and PA
+
+Rejected. VA space (64 GiB device-wide) and PA space (per-slice ~6
+GiB) have different semantic dimensions — VA is the kernel's view, PA
+is the device sub-unit's view. ADR-0011's VA model (MMU maps between
+the two) calls for separated allocators.
+
+### A4. Multi-tier page sizes (large pages + small pages)
+
+Rejected (currently). A single page size (2 MiB) matches LLM kernel
+tensor sizes (a few MiB to GiB); smaller mappings are absorbed by
+ADR-0039 D3's sub-page region mode. Multi-tier paging would require
+extending the MMU model itself — a separate ADR candidate.
+
+## Consequences
+
+- The allocator algorithm is pinned at ADR level (D1, D3, D4), so any
+  future simulation scenario hitting fragmentation has a clear "we're
+  using first-fit + coalescing" anchor to inspect.
+- D2's trust model is explicit, so any future code path that exposes
+  alloc/free to direct user input will trigger this ADR's supersession
+  early.
+- D7's one-allocator-per-sub-unit mapping is on record, so when M_CPU
+  or SRAM need their own free-list, the addition point is obvious.
+- D4 captures the page_size dual-default and its production path
+  (`_ensure_allocators` always wins), letting future `topology.yaml`
+  `page_size` changes be assessed against ADR-0039's stopgap
+  interaction quickly.
@@ -0,0 +1,247 @@
+# ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
+
+## Status
+
+Accepted (2026-05-22).
+
+Pins down the traffic-pattern catalog, formula-vs-actual comparison, and
+invariant checks (monotonicity, D2H ≥ H2D, etc.) exposed by
+`probes/probe.py::run_probe(...)`. ADR-0010 (CLI surface) enumerates the
+`kernbench probe` subcommand, but **what probe actually measures** and
+**which invariants it judges PASS/FAIL** had no ADR-level coverage.
+
+## First action
+
+`run_probe(topology_path, case_filter=None)` performs four startup steps:
+
+1. `Path(topology_path).expanduser().resolve()` → absolute path.
+2. `load_topology(path)` → `TopologyGraph` (graph + spec).
+3. `_build_edge_map(graph)` → a `{(src, dst): Edge}` lookup table.
+4. Instantiate `AddressResolver(graph)` + `PathRouter(graph)`.
+
+Then it sets `nbytes = 32768` (= 32 KiB, the summary-table reference
+size) and `show_all = (case_filter is None or case_filter == "all")`.
+
+In short, **probe's first act is "load the topology once and prepare
+edge map / resolver / router, plus pin 32 KiB as the standard measurement
+size"**. After that, the H2D → D2H → PE DMA categories execute in
+separate `GraphEngine` instances (no cross-talk between cases).
+
+## Context
+
+`kernbench probe` was introduced as a verification tool for these
+purposes:
+
+- **Manual ground truth**: when a real-simulation result (`kernbench run
+  --bench ...`) shows abnormal latency, derive the answer for a simple
+  traffic pattern in isolation and compare.
+- **Formula vs actual**: check whether the analytical model
+  (wire latency + overhead + drain) matches the simulator's
+  `total_ns`. A mismatch points to which simplifying assumption in
+  ADR-0033 is missing.
+- **Monotonicity check**: latency should grow monotonically with hop
+  count.
+- **Utilization sweep**: a BW-utilization table across data sizes
+  (4 KiB ~ 1 MiB).
+
+Without an ADR for this tool:
+
+- Adding a new traffic-pattern category (e.g., MCpuDma, IPCQ) is hard
+  because the table format / measurement units of existing categories
+  aren't documented at the ADR level.
+- The basis for the monotonicity check (hop count? cube distance? wire
+  length?) is ambiguous.
+- The reference size 32 KiB and the sweep `[4 KiB, 16 KiB, 64 KiB, 256
+  KiB, 1 MiB]` are only discoverable by reading source.
+
+## Decision
+
+### D1. Three case categories — H2D / D2H / PE DMA
+
+Each category has a distinct data path in the topology and gets its own
+summary table + sweep table + route-detail block.
+
+- **H2D (Host → Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
+  dst_pe=0, pattern="zero")` flows along `pcie_ep → io_cpu → m_cpu →
+  hbm_ctrl`. The cube index varies the hop count:
+  - h2d-1hop: cube=0, hops=1
+  - h2d-2hop: cube=4, hops=2
+  - h2d-3hop: cube=8, hops=3
+  - h2d-4hop: cube=12, hops=4
+- **D2H (Device → Host Read)**: `MemoryReadMsg(src_sip=0, src_cube,
+  src_pe=0)`. Total latency = forward command path + reverse data path.
+  Same 4-hops category as H2D.
+- **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe,
+  dst_pa)`. Five cases cover varying cube/PE positions:
+  - pe-local-hbm: same cube, same PE
+  - pe-same-half-hbm: same cube, different PE (PE 1)
+  - pe-cross-half-hbm: same cube, far PE (PE 4)
+  - pe-cross-cube-hbm-best: adjacent cube (cube 1)
+  - pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
+
+The cube indices 4/8/12 (H2D) and 1/4/15 (PE DMA) are meaningful for a
+4 × 4 cube mesh (`sip.cube_mesh.w=4, h=4`); changes to the mesh size
+require these to be updated in lockstep.
+
+### D2. Standard measurement size — `nbytes = 32768` (32 KiB)
+
+Every case in the summary table runs once with `nbytes=32768`. 32 KiB
+was chosen because:
+
+- DMA overhead and BW drain are balanced — neither dominates.
+- It compares cleanly against the one-shot transfer size of several
+  sub-units (TCM, register file).
+
+Per-size utilization variations are shown in a separate sweep table
+(D3).
+
+### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
+
+`SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`,
+`SWEEP_LABELS = ["4KB", "16KB", "64KB", "256KB", "1MB"]`. Per size:
+
+```
+drain   = nbytes / bottleneck_bw
+total   = overhead + wire + drain
+eff_bw  = nbytes / total
+util%   = eff_bw / bottleneck_bw × 100
+```
+
+When `bn_bw is None or <= 0`, the column shows 0.0 %. The intent: the
+table shows in one view how small transfers become overhead-bound and
+large transfers become drain-bound as hop count rises.
+
+### D4. Measured columns — actual / formula / breakdown
+
+Per-case columns:
+
+- `Actual` (total_ns): the SimPy run's `trace["total_ns"]`.
+- `Ovhd`: sum of `node.attrs["overhead_ns"]` along the path (formula).
+- `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
+- `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
+- `Ovhd%` / `Drain%`: each portion as a percentage of Actual. Wire is
+  usually too small to display.
+- `Eff.BW`: `nbytes / total_ns` (measured BW).
+- `BN.BW`: bottleneck bandwidth (formula). The minimum edge BW along
+  the path. Missing edge BW shows "-".
+- `Util%`: `Eff.BW / BN.BW × 100`. 100 % means the single-stream BW
+  upper bound is reached.
+
+A large gap between the formula sum (`wire + ovhd + drain`) and Actual
+signals a factor the simplified model misses (a place to inspect
+ADR-0033's assumptions).
+
+### D5. Automatic invariant checks — PASS/FAIL
+
+The following invariants are reported with `[v] PASS` / `[x] FAIL`:
+
+- **H2D / D2H monotonic increase**: as hop count rises, actual latency
+  must grow monotonically. `all(lats[i] < lats[i+1] for ...)`.
+- **D2H ≥ H2D**: for the same hop index, D2H ≥ H2D (D2H has both
+  forward command and reverse data legs). `all(d2h[i].total >=
+  h2d[i].total)`.
+- **PE DMA best < worst**: cross-cube best (adjacent) latency must be
+  less than cross-cube worst (diagonal).
+- **PE DMA local vs remote**: prints the local BN BW vs remote BN BW
+  side-by-side (informational, not PASS/FAIL).
+
+When a check fails, a single clear line surfaces the regression for
+human review.
+
+### D6. Route detail — per-hop timestamp trace
+
+After the summary and sweep tables, each case's path and cumulative
+per-hop timestamps (`_hop_timestamps`) appear in a separate section:
+
+- H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
+  (`m_cpu → hbm_ctrl`) + per-hop trace.
+- D2H: forward (cmd, no data) and reverse (data) traces shown
+  separately.
+- PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
+
+Each hop's timestamp is cumulative `wire_ns + overhead_ns`. The
+terminal hop's annotation appends `drain:Xns`. Bottleneck edges are
+marked `<BN:XXGB/s>` so they are visually identifiable.
+
+### D7. Semantics of the `case_filter` argument
+
+- `None` or `"all"`: run all cases (default).
+- Other strings: run only the case whose name matches exactly. Example:
+  `kernbench probe --case h2d-2hop`.
+
+Within a category, cases with `name != case_filter` are skipped; if
+only one data point remains, the category's monotonicity / D2H ≥ H2D
+comparisons are naturally skipped.
+
+The CLI parser's `--case` default is `"all"`, so omitting it runs
+everything.
+
+### D8. Fresh GraphEngine per case
+
+Each of the 4 H2D, 4 D2H, and 5 PE DMA cases runs in **its own
+GraphEngine** (`engine = GraphEngine(graph)`). Reasons:
+
+- Isolate accumulated state (op_log, completion tracking, allocators)
+  so cases do not cross-talk.
+- Guarantee one case's traffic does not perturb another case's BW
+  measurement.
+
+This isolation lets probe results be interpreted as **single-flow**
+per-case latency. Multi-flow contention measurement is handled by
+separate tooling (e.g., the `pe2pe_overview` plot or ADR-0033's
+multi-flow merging model).
+
+### D9. Output-format stability
+
+probe's stdout is meant for humans; precise column widths, separators,
+and whitespace are **not** a machine-readable contract. Automated tools
+that wish to parse probe output should use a separate JSON-output mode
+(not yet implemented).
+
+The `[v]` / `[x]` prefix on PASS/FAIL lines is a stable CI grep anchor.
+
+## Alternatives Considered
+
+### A1. Register probe as another bench (`@bench(name="probe")`)
+
+Rejected. probe is a verification tool, not a bench — multi-engine
+execution for sweeps/analysis and PASS/FAIL invariant output are
+essential, none of which fits ADR-0045's "single device + single
+RuntimeContext" bench model.
+
+### A2. Exit code 1 on monotonicity violation
+
+Rejected (currently). probe is positioned as a human inspection tool —
+PASS/FAIL is printed and exit is 0. A wrapper can `grep "\[x\]"` to
+decide. A future `--strict` flag could opt into non-zero exits.
+
+### A3. Externalize the case catalog to YAML
+
+Rejected (currently). The 8 cases (4 H2D + 4 D2H + 5 PE DMA = 13 total)
+are hardcoded and their semantics are tightly bound to the mesh
+topology. Moving cube-index meaning (4, 8, 12 / 1, 4, 15) into YAML
+would require separate documentation and lose cohesion. Externalize
+only when case additions become frequent.
+
+### A4. Add multi-flow contention measurement
+
+Rejected (out of probe scope). D8's single-flow isolation is probe's
+core intent. Multi-flow contention belongs in a different area of the
+ADR-0033 latency model — either a separate tool or a new case
+category.
+
+## Consequences
+
+- probe's case catalog (D1) and measurement units (D2/D3) are pinned at
+  ADR level, so new traffic categories know which table format to
+  follow.
+- The semantics of the formula-vs-actual columns (D4) are locked in, so
+  questions like "why is Drain% 5 % or 70 %?" can quickly be linked to
+  ADR-0033 assumption checks.
+- Automatic invariant checks (D5) are pinned, so latency-model changes
+  immediately catch monotonicity / D2H ≥ H2D regressions.
+- D8's case-isolation is explicit, so probe results are safe to read as
+  single-flow measurements. If multi-flow is needed, a separate tool
+  track is clearly required.
+- A2's strict-mode flag is recorded as a follow-up so CI integration
+  has a minimal change path when requested.
@@ -0,0 +1,322 @@
+# ADR-0050: CCL Algorithm Module Contract — `ccl/algorithms/*.py`
+
+## Status
+
+Accepted (2026-05-22).
+
+Pins down the interface, kernel signature, and addition workflow that a
+module under `src/kernbench/ccl/algorithms/` must satisfy in order to be
+used as a collective algorithm by the AHBM CCL backend (ADR-0047).
+ADR-0047 D3 states only that "the algorithm module must expose `kernel`,
+`kernel_args`, optionally `TOPO_NAME_TO_KIND`"; **the contract an
+algorithm-module author needs to follow** has had no ADR-level coverage.
+This ADR pairs with ADR-0045's bench-module contract.
+
+## First action
+
+An algorithm module is imported at two moments:
+
+1. **AHBM backend entry**: when user code calls
+   `dist.init_process_group(backend="ahbm")`,
+   `AhbmCCLBackend.__init__` runs
+   `self._algo_module = importlib.import_module(self._merged["module"])`.
+   At module level, the following occur first:
+   - Topology-kind integer constants like `SIP_TOPO_RING/TORUS/MESH`
+     are bound in the module namespace.
+   - The `TOPO_NAME_TO_KIND` dict is bound; the backend reads it via
+     `getattr(self._algo_module, "TOPO_NAME_TO_KIND", None)`.
+   - `kernel_args` function is defined for the caller.
+   - The actual algorithm function (e.g.,
+     `allreduce_intercube_multidevice`) is defined.
+   - At the bottom of the module, `kernel = allreduce_intercube_multidevice`
+     publishes the alias.
+
+2. **ccl.yaml install stage**:
+   `kernbench.ccl.install.install_ipcq` imports the same algorithm
+   module while pushing the IPCQ neighbor table.
+
+In short, **the algorithm module's first act is "publish topology-kind
+constants, the `TOPO_NAME_TO_KIND` dict, the `kernel_args` function, and
+the `kernel` alias into the module namespace"** — all as import-time
+side effects, no separate initialization call.
+
+## Context
+
+`AhbmCCLBackend` (ADR-0047), at process-group creation, dynamically
+imports a module path obtained from `ccl.yaml`'s `defaults.algorithm` (or
+a user-specified algorithm). The backend expects four things from the
+module:
+
+- `kernel`: the collective's entry function.
+- `kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple`: a tuple
+  packing the kernel's positional arguments.
+- `TOPO_NAME_TO_KIND` (optional): a dict mapping `topology.yaml`'s
+  `sips.topology` string (e.g., `"ring_1d"`, `"torus_2d"`,
+  `"mesh_2d_no_wrap"`) to the integer kind constants.
+- (Indirectly) IPCQ neighbor-table install:
+  `configure_sfr_intercube_multisip` reads
+  the module's `TOPO_NAME_TO_KIND` plus cube dimensions to decide the
+  SFR.
+
+The current corpus has one algorithm module:
+`lrab_hierarchical_allreduce.py` (248 lines). The name expands to
+"**l**eft-**r**ight **a**lternating **b**roadcast hierarchical allreduce".
+When future modules like `ring_allreduce`, `tree_allreduce`, or
+`broadcast` are added, they must follow this contract for the backend's
+dispatch path to keep working.
+
+Without an ADR-level contract:
+
+- A new algorithm author has to infer the signature from ADR-0047 D3's
+  one-liner.
+- The kernel-function argument order (especially `t_ptr, n_elem,
+  cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w,
+  sip_topo_h, tl`) is unclear without grep.
+- It is conventional, but not documented, what `kernel_args` takes as
+  inputs and what tuple it must return.
+
+## Decision
+
+### D1. The algorithm module exposes four public symbols
+
+```python
+# src/kernbench/ccl/algorithms/<name>.py
+
+from __future__ import annotations
+
+# (required) topology-kind constants — referenced internally
+SIP_TOPO_RING = 0
+SIP_TOPO_TORUS = 1
+SIP_TOPO_MESH = 2
+
+# (optional) topology name → kind mapping. Used by the backend to
+# translate ccl.yaml/topology's string SIP topology into an integer.
+TOPO_NAME_TO_KIND = {
+    "ring_1d": SIP_TOPO_RING,
+    "torus_2d": SIP_TOPO_TORUS,
+    "mesh_2d_no_wrap": SIP_TOPO_MESH,
+}
+
+# (required) kernel argument builder
+def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
+    return (n_elem, cube_w, cube_h, world_size)
+
+# (required) kernel function (TLContext is injected via the `tl=...`
+# keyword argument).
+def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips,
+                         sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl):
+    ...
+
+# (required) kernel alias — the backend accesses `module.kernel`
+kernel = my_allreduce_kernel
+```
+
+- The `kernel` alias is the entry point the backend invokes. Whatever
+  the function name is (e.g., `allreduce_intercube_multidevice`), it
+  must be exposed via `module.kernel = fn`.
+- Without `kernel_args`, the backend has no way to build the
+  algorithm's argument list. See D2 for the signature.
+- If `TOPO_NAME_TO_KIND` is absent, the backend falls back to
+  `sip_topo_kind = 0`. An algorithm supporting only a single topology
+  may omit it.
+
+### D2. `kernel_args` signature — `(world_size, n_elem, *, cube_w, cube_h)`
+
+```python
+def kernel_args(world_size: int, n_elem: int, *,
+                cube_w: int = 4, cube_h: int = 4) -> tuple:
+    return (n_elem, cube_w, cube_h, world_size)
+```
+
+- **Positional arguments**: `world_size` (= number of ranks), `n_elem`
+  (= element count of a single shard, f16-based).
+- **Keyword arguments**: `cube_w`, `cube_h` (= cube-mesh dimensions).
+  Default 4×4 — aligned with `topology.yaml`'s `sip.cube_mesh` default.
+- **Return**: a tuple in the order the kernel's positional arguments
+  expect.
+
+When the backend calls `all_reduce`:
+
+```python
+kernel_args_tuple = self._algo_module.kernel_args(
+    self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h,
+)
+extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)
+pending = self.ctx.launch(
+    self._merged["algorithm"], kernel_fn, tensor,
+    *kernel_args_tuple, *extra_args, _defer_wait=True,
+)
+```
+
+So the kernel's full positional argument list becomes: `(tensor_ptr,
+*kernel_args_tuple, sip_rank, sip_topo_kind, sip_topo_w,
+sip_topo_h)`, with `tl=...` injected as a keyword. The tuple length
+and order returned by `kernel_args` must **match the kernel signature
+1:1**.
+
+### D3. Kernel signature — standardized 9 + tl arguments
+
+Recommended signature:
+
+```python
+def my_kernel(
+    t_ptr: int,             # VA base of the row-wise-sharded tensor on this SIP
+    n_elem: int,            # element count per cube tile (or per shard)
+    cube_w: int,            # cube mesh width  (from kernel_args)
+    cube_h: int,            # cube mesh height (from kernel_args)
+    n_sips: int,            # equal to world_size (rank = SIP, ADR-0024)
+    sip_rank: int,          # this SIP's rank
+    sip_topo_kind: int,     # result of TOPO_NAME_TO_KIND lookup
+    sip_topo_w: int,        # SIP mesh width (0 for ring_1d)
+    sip_topo_h: int,        # SIP mesh height (0 for ring_1d)
+    *, tl,                  # TLContext (auto-injected)
+) -> None:
+```
+
+Even if `kernel_args` chose a different positional argument order, the
+kernel's **last four positional arguments are always
+`(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`** — the backend
+appends them as `extra_args` (ADR-0047 D5). A custom algorithm must
+accept these four, but a single-SIP algorithm may simply ignore them.
+
+`tl` is injected via keyword — `RuntimeContext.launch` adds `tl=tl_ctx`
+just before invoking the kernel. The signature therefore exposes `tl`
+as keyword-only (`*, tl`) or as the trailing keyword parameter.
+
+### D4. Kernel body — freedom and constraints
+
+Available inside the kernel: every `tl.*` primitive from ADR-0046 D3.
+
+Common patterns:
+
+- `cube_id = tl.program_id(axis=1)` — this PE's cube index.
+- `pe_addr = t_ptr + cube_id * nbytes` — per-cube VA of the tile.
+- `acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")` — load local
+  data.
+- `tl.send(dir=...)` / `tl.recv(dir=..., shape=, dtype=)` — IPCQ
+  collective.
+- `acc = acc + recv` — TensorHandle arithmetic operators (ADR-0046 D4).
+- `tl.store(pe_addr, acc)` — store the result.
+
+The kernel body is plain Python — branching and loops are fine. But:
+
+- No SimPy `yield` or `async` (ADR-0046 D1).
+- No direct access to TensorHandle `.data` — the Phase 1 timing model
+  doesn't see data dependencies (ADR-0020's 2-pass separation).
+- Kernel execution must be deterministic — the same input must produce
+  the same op sequence. No random or external IO.
+
+### D5. SIP topology semantics — meaning of `sip_topo_kind`
+
+The backend looks up `topology.yaml`'s `system.sips.topology` string
+in the algorithm module's `TOPO_NAME_TO_KIND` and passes the integer
+as `sip_topo_kind`. The algorithm then branches:
+
+```python
+if sip_topo_kind == SIP_TOPO_RING:
+    acc = _inter_sip_ring(...)
+elif sip_topo_kind == SIP_TOPO_TORUS:
+    acc = _inter_sip_torus_2d(...)
+elif sip_topo_kind == SIP_TOPO_MESH:
+    acc = _inter_sip_mesh_2d(...)
+```
+
+Each topology branch communicates with peers via IPCQ direction names
+(`"global_E"`, `"W"`, `"S"`, `"N"` …). Direction semantics are defined
+in ADR-0023/0025; `configure_sfr_intercube_multisip` installs the IPCQ
+neighbor table accordingly.
+
+If a topology kind not supported by the algorithm appears, prefer an
+explicit `raise ValueError(f"unsupported topology kind
+{sip_topo_kind}")` over a silent no-op — fail fast on misconfiguration.
+
+### D6. The `ccl.yaml` algorithm entry
+
+The algorithm module is paired with a `ccl.yaml` entry (ADR-0023 D10 +
+ADR-0047 D3):
+
+```yaml
+defaults:
+  algorithm: lrab_hierarchical_allreduce
+  n_elem: 8
+
+algorithms:
+  lrab_hierarchical_allreduce:
+    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
+    # optional: world_size override
+    # optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip
+```
+
+- `module`: the full Python module path; `importlib.import_module`
+  consumes this string as-is.
+- `world_size` (optional): when set, overrides the topology fallback
+  (ADR-0047 D2).
+- Algorithm-specific parameters are consumed by
+  `configure_sfr_intercube_multisip`.
+
+Workflow to add a new algorithm:
+
+1. Write `src/kernbench/ccl/algorithms/<name>.py` following D1.
+2. Add the entry under `algorithms` in `ccl.yaml`.
+3. (If needed) extend `kernbench.ccl.sfr_config` with the SFR-install
+   branch.
+4. Add tests (e.g., `tests/sccl/test_<name>.py`, extending the
+   ADR-0043 eval harness).
+
+### D7. Legacy "rank = flat PE index" mode
+
+The `world_size` override in `ccl.yaml`, surfaced by ADR-0047 D2, is
+used by legacy "rank = flat PE index" tests. The algorithm module can
+assume `n_sips=world_size` ranks even in this mode — the backend
+maintains the rank↔(SIP, cube, PE) mapping, so no modal branching is
+needed inside the algorithm body.
+
+In single-cube workloads (where `cube_w=cube_h=1`), the algorithm must
+skip mesh-based phases — see the
+`single_cube = (cube_w == 1 and cube_h == 1)` pattern in
+`lrab_hierarchical_allreduce.py`.
+
+## Alternatives Considered
+
+### A1. Organize the algorithm module as a class (`class Allreduce: kernel(...)`)
+
+Rejected. The Python module namespace already identifies an algorithm
+(see ADR-0047 D3's `importlib.import_module`). A class wrapper adds
+indirection without simplifying dispatch. Module-level free functions
+plus a `kernel` alias are clean and obvious.
+
+### A2. Type `kernel_args` with an explicit dataclass
+
+Rejected (currently). Each algorithm normally has a different argument
+count; forcing one dataclass would hurt cross-algorithm interchange.
+The tuple return is simple and unpacks cleanly with the backend's
+`*kernel_args_tuple`. If an algorithm wants stronger internal typing,
+it may define its own NamedTuple.
+
+### A3. Move SFR installation inside the algorithm module
+
+Rejected. SFR installation
+(`configure_sfr_intercube_multisip`) is a cross-module decision
+combining topology + algorithm; `kernbench.ccl.sfr_config` is a more
+natural home than the algorithm module itself. D6's "extend
+sfr_config if needed" workflow keeps responsibility boundaries clear.
+
+### A4. Auto-register algorithm names via a decorator (analogous to ADR-0045's `@bench`)
+
+Rejected. Unlike benches, algorithms are already tied to `ccl.yaml`
+entries; an additional registry would be redundant. The string mapping
+in `module` is sufficient.
+
+## Consequences
+
+- ADR-0047 D3's one-line contract expands to a D1–D7 author-facing
+  guide; new algorithm signatures no longer need to be grep-derived.
+- D3's standardized 9 + tl signature couples naturally with the
+  backend's `extra_args` append (ADR-0047 D5). It is explicit that
+  even single-SIP-only algorithms must accept the four `sip_*` trailing
+  arguments.
+- D5's fail-loud recommendation means a `ccl.yaml` topology that the
+  algorithm doesn't support will surface as an explicit `ValueError`
+  rather than a silent wrong result.
+- D6's step-by-step addition workflow makes clear how far a new
+  algorithm has to reach into sfr_config / tests / ccl.yaml.
@@ -0,0 +1,288 @@
+# ADR-0051: Routing Helper API — `AddressResolver` + `PathRouter`
+
+## Status
+
+Accepted (2026-05-22).
+
+Pins down every public API, argument, return value, and adjacency-graph
+selection of the two helper classes (`AddressResolver`, `PathRouter`)
+exposed by `policy/routing/router.py`. ADR-0002 defines routing
+distance, ordering, and bypass rules, but **the helper API surface
+itself** has had no ADR-level coverage.
+
+## First action
+
+### `AddressResolver(graph)`
+
+On construction, caches two pieces of state:
+
+1. `self._node_ids = set(graph.nodes)` — a set of all node ids for
+   lookup.
+2. `self._hbm_slice_bytes = hbm_total_gb * (1 << 30) // slices_per_cube`
+   — derived from `graph.spec.cube.memory_map` (default `48 GB / 8
+   slices = 6 GB`). `resolve()` uses this value to decode `pe_id` from
+   an HBM PA's `hbm_offset`.
+
+In short, **AddressResolver's first act is "precompute the full set of
+node ids and the HBM slice size"**. It does not retain the graph
+itself.
+
+### `PathRouter(graph)`
+
+On construction, **builds four separate adjacency graphs in one pass**:
+
+1. `self._adj_all`: every edge (used for component-to-component
+   routing).
+2. `self._adj`: edges with `kind != "command"` (PE DMA / generic data
+   paths).
+3. `self._adj_mcpu_dma`: excludes
+   `_MCPU_DMA_EXCLUDE = {"pe_internal", "pe_to_router"}` (M_CPU DMA
+   must not pass through PE pipeline nodes).
+4. `self._adj_local`: excludes the 8-element `_UCIE_KINDS` set (UCIe
+   would look like a zero-distance bus to Dijkstra, which would prefer
+   it over the mesh — for cube-local routing this must be avoided).
+
+Each graph is a `defaultdict(list)` of `(neighbor, weight)`. The
+weight is `edge.routing_weight_mm or edge.distance_mm`.
+
+In short, **PathRouter's first act is "classify topology edges into
+four policy-specific adjacency lists simultaneously"**. Each `find_*()`
+call picks the appropriate graph and runs Dijkstra.
+
+## Context
+
+`policy/routing/router.py` performs two responsibilities together:
+
+- **Naming**: it is the sole owner of the topology naming convention
+  (`sip{S}.cube{C}.<comp>`, `sip{S}.io{I}.pcie_ep`, etc.). Components /
+  probe / IPCQ install / runtime API do not build node-id strings
+  themselves — they call helpers.
+- **Path decisions**: policy separation by `edge.kind`. For the same
+  src→dst, different routing intents (PE DMA vs M_CPU DMA vs general
+  component routing) call for different adjacencies and so produce
+  different paths.
+
+This helper API is widely consumed (probe.py / distributed.py /
+install.py / various components / tests), yet **the exact signatures /
+return semantics / adjacency picks** are not gathered in any ADR. This
+ADR closes that gap.
+
+## Decision
+
+### D1. `AddressResolver` exposes five public methods
+
+#### D1.1. `resolve(addr: PhysAddr) -> str`
+
+Translates a `PhysAddr` to a destination node id in the topology:
+
+```
+addr.kind == "hbm"             → f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
+  where pe_id = addr.hbm_offset // self._hbm_slice_bytes  (ADR-0017 D4/D9)
+
+addr.kind == "pe_resource":
+  addr.unit_type == PE         → f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
+  addr.unit_type == SRAM       → f"sip{s}.cube{d}.sram"
+  addr.unit_type == MCPU       → f"sip{s}.cube{d}.m_cpu"
+  others                       → RoutingError("unsupported unit_type")
+
+other kinds                    → RoutingError("unsupported address kind")
+```
+
+If the derived node id is not in `self._node_ids`, raises
+`RoutingError(f"node {node_id} not found in topology")`. So even when
+the address has valid syntax, an absent node in the topology
+fails-loud.
+
+#### D1.2. `find_m_cpu(sip, cube) -> str`
+
+Returns `f"sip{sip}.cube{cube}.m_cpu"`; absent → `RoutingError`.
+
+#### D1.3. `find_pcie_ep(sip, io_id="io0") -> str`
+
+Returns `f"sip{sip}.{io_id}.pcie_ep"`; absent → `RoutingError`.
+
+#### D1.4. `find_io_cpu(sip, io_id="io0") -> str`
+
+Returns `f"sip{sip}.{io_id}.io_cpu"`; absent → `RoutingError`.
+
+#### D1.5. `find_all_pcie_eps() -> list[str]`
+
+All PCIE_EP node ids across all SIPs, sorted. Filtered by
+`endswith(".pcie_ep")`. Cross-SIP IPCQ uses this when enumerating
+PCIE_EPs.
+
+This class is the sole owner of the naming convention
+(`sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`) — ADR-0015 D4.
+The topology builder produces nodes with the same naming convention;
+components never build node-id strings directly — they go through
+these helpers.
+
+### D2. `PathRouter`'s four adjacency graphs
+
+Constructed in one pass. `edge.kind` drives policy:
+
+| graph             | excluded edge kinds                                                                                                                   | use case                                       |
+|-------------------|--------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
+| `_adj_all`        | (none)                                                                                                                               | M_CPU↔NOC command included, IO_CPU/M_CPU routes |
+| `_adj`            | `"command"`                                                                                                                          | PE DMA / generic data paths                     |
+| `_adj_mcpu_dma`   | `"pe_internal"`, `"pe_to_router"`                                                                                                    | M_CPU DMA (skips PE pipeline)                   |
+| `_adj_local`      | `_UCIE_KINDS` (`ucie_internal`, `ucie_conn_to_router`, `router_to_ucie_conn`, `ucie_conn_to_noc`, `noc_to_ucie_conn`, `ucie_mesh`, `io_to_cube`, `cube_to_io`) | same-cube routing (UCIe bus excluded) |
+
+Each graph is `dict[node_id, list[(neighbor, weight)]]` with weight =
+`edge.routing_weight_mm or edge.distance_mm`. Excluding command edges
+prevents them from influencing routing; isolating `_adj_local` keeps
+UCIe's "zero-distance bus" from out-competing the mesh — consistent
+with ADR-0017 D7's cross-PE-slice mesh-distance requirement.
+
+### D3. `PathRouter` exposes six public methods (+ two backward-compat shims)
+
+#### D3.1. `find_path(src_pe: str, dst_node: str) -> list[str]`
+
+**PE DMA routing**. `src_pe` is a PE prefix (e.g.,
+`"sip0.cube0.pe0"`); the function auto-prepends `.pe_dma`, making the
+true start node `"sip0.cube0.pe0.pe_dma"`.
+
+Adjacency depends on cube-locality (`_same_cube`):
+
+- **Same-cube** (src and dst share `sip{S}.cube{C}.` prefix): uses
+  `_adj_local`. Excluding UCIe lets cross-PE-slice access pay accurate
+  mesh distance (ADR-0017 D7).
+- **Cross-cube**: uses `_adj`. UCIe naturally becomes the right choice
+  for the cross-cube portion.
+
+#### D3.2. `find_path_with_distance(src_pe, dst_node) -> tuple[list[str], float]`
+
+Same adjacency policy as D3.1, but returns `(path, total_distance)`.
+Used by probe and analysis tools that need the distance metric.
+
+#### D3.3. `find_mcpu_dma_path(m_cpu_id: str, dst_hbm_id: str) -> list[str]`
+
+**M_CPU DMA path**. Same cube → `_adj_local` (stay within the mesh);
+different cube → `_adj_all` (cross via UCIe). The
+`_MCPU_DMA_EXCLUDE` set ensures PE-pipeline nodes never appear on
+M_CPU's routes.
+
+#### D3.4. `find_memory_path(src: str, dst: str) -> list[str]`
+
+Direct memory path like
+`pcie_ep → io_noc → cube → router mesh → hbm_ctrl`. Uses
+`_adj_mcpu_dma` to exclude `pe_internal` and `pe_to_router`, so
+host-issued reads/writes never leak into the PE pipeline. Probe
+(ADR-0049 D1's H2D/D2H cases) calls this directly.
+
+#### D3.5. `find_node_path(src: str, dst: str) -> list[str]`
+
+Generic routing between arbitrary nodes, **including command edges**
+(via `_adj_all`). IoCpuComponent / MCpuComponent use this when they
+need to route through M_CPU ↔ NOC command-kind links.
+
+#### D3.6. Backward-compat shims
+
+- `_dijkstra(start, goal) -> list[str]` — thin wrapper for
+  `_run_dijkstra(self._adj, …)`.
+- `_dijkstra_with_dist(start, goal) -> tuple[list[str], float]` —
+  distance-aware variant.
+
+Despite the underscore prefixes (suggesting internal API), existing
+tests call these directly. New code should prefer D3.1–D3.5; these two
+shims are deprecation candidates.
+
+### D4. Dijkstra — single-source shortest path
+
+`_run_dijkstra_with_dist(adj, start, goal)`:
+
+- `heapq` priority queue.
+- `best: dict[node, distance]` — best known distance to each node.
+- `prev: dict[node, predecessor]` — for path reconstruction.
+- Edge weight = `routing_weight_mm or distance_mm`. The separation
+  matters because UCIe (and a few others) declare an explicit
+  `routing_weight_mm` distinct from physical `distance_mm`.
+
+`start == goal` short-circuits to `([start], 0.0)`. Unreachable target
+→ `RoutingError(f"no path from {start} to {goal}")`.
+
+The algorithm is **deterministic**: identical graph + start/goal gives
+the same path, satisfying SPEC R1 ("routing MUST be deterministic").
+Tie-breaks follow `heapq`'s push order (Python list order is
+deterministic).
+
+### D5. Single-owner principle for helper-API decisions
+
+The following decisions live only inside router.py:
+
+- Naming convention: `sip{S}.cube{C}.<comp>`,
+  `sip{S}.{io_id}.<comp>`,
+  `sip{S}.cube{C}.hbm_ctrl.pe{pe_id}`.
+- Adjacency policy: which edge kinds belong to which graph.
+- Algorithm for recovering PE id from an HBM slice size.
+- Dijkstra weight selection
+  (`routing_weight_mm or distance_mm`).
+
+Breaking single ownership (e.g., a component starting to build
+`f"sip{s}..."` itself) would explode the blast radius of naming-
+convention changes. This aligns with ADR-0015 D4.
+
+### D6. Consumers of the helper API
+
+Methods listed in this ADR are called from (current corpus):
+
+- `probes/probe.py` (ADR-0049): `find_pcie_ep`, `find_io_cpu`,
+  `find_m_cpu`, `find_node_path`, `find_mcpu_dma_path`,
+  `find_memory_path`, `find_path`, `resolve`.
+- `runtime_api/distributed.py` (ADR-0047): indirectly (engine-internal
+  routing).
+- `ccl/install.py` (ADR-0023): `find_all_pcie_eps`, `resolve`.
+- `sim_engine/event_log.py`: like probe — `find_pcie_ep`,
+  `find_memory_path`.
+- `components/builtin/m_cpu.py`, `components/builtin/io_cpu.py`:
+  `find_node_path`, `find_mcpu_dma_path`.
+- Tests (test_routing.py, test_cross_sip_routing.py, …): most of
+  D3.1–D3.5.
+
+When a new consumer arrives, D1/D3 act as a first-pass guide on
+whether an existing method matches the intent or a new one is needed.
+
+## Alternatives Considered
+
+### A1. One adjacency graph + per-call edge-kind filtering
+
+Rejected. Re-filtering the graph on every `find_*()` call hurts
+Dijkstra cache locality. Constructing four graphs in one pass (D2)
+has modest memory cost (edges ≤ a few × 10⁴), and selection happens
+in O(1) at call time.
+
+### A2. Drive adjacency separation by separate edge metadata rather than `kind`
+
+Rejected. `edge.kind` is already assigned by the topology builder
+(ADR-0015 D4 + ADR-0017); a parallel metadata field would force
+synchronization between two systems.
+
+### A3. Use BFS with uniform weights instead of Dijkstra
+
+Rejected. With per-edge `routing_weight_mm` (mesh link / UCIe /
+IO-internal), BFS minimizes hop count rather than total
+latency/distance. SPEC R1 + R2 require deterministic and accurate
+routing, which BFS does not deliver.
+
+### A4. Express the helper API as module functions instead of classes
+
+Rejected. Each class
+(`AddressResolver`, `PathRouter`) maintains caches
+(`_node_ids`, `_hbm_slice_bytes`, four adjacency graphs) reused across
+many routing queries on the same graph. Module functions would have
+to rebuild state per call or go global, hurting safety and
+performance.
+
+## Consequences
+
+- When components / probe / IPCQ install / runtime API all go through
+  router.py helpers, a naming-convention change (e.g., `.io0.` →
+  `.iochiplet0.`) is a one-file edit (D5).
+- D2's four-graph split is now ADR-locked, so when a new edge kind is
+  added (e.g., a new inter-die UCIe-link kind), the right adjacency
+  category is decided explicitly rather than by default.
+- D3.1's same-cube vs cross-cube branching (ADR-0017 D7) is explicit,
+  so anyone changing routing knows which adjacency to touch.
+- D6's consumer list bounds PR-review scope for helper-API changes,
+  and the backward-compat shims (D3.6) are flagged as deprecation
+  candidates.
@@ -0,0 +1,371 @@
+# ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals
+
+## Status
+
+Accepted (2026-05-22).
+
+Pins down the `OpRecord` schema and the `record_start` / `record_end` /
+`record_copy` behavior in `sim_engine/op_log.py`, plus the
+(space, addr) namespace and read/write semantics of `MemoryStore` in
+`sim_engine/memory_store.py`. ADR-0020 (2-pass data execution) declares
+that these two facilities exist, but **the precise record fields and
+semantics** had no ADR-level coverage, and several recent ADRs
+(ADR-0046 D3.2's `tl.store` visibility, ADR-0023 D9's IPCQ copy
+record) depend on these semantics.
+
+## First action
+
+### `OpLogger(memory_store=None)`
+
+On construction, initialize three fields:
+
+1. `self._records: list[OpRecord] = []` — accumulated records.
+2. `self._pending: dict[int, dict] = {}` — partial records keyed by
+   `id(msg)` (created at `record_start`, completed at `record_end`).
+3. `self._memory_store = memory_store` — optional MemoryStore
+   reference. Used to capture math-op input snapshots and dma_write
+   HBM-source snapshots.
+
+Records and pending are empty; the `record_*` calls accumulate data
+over time.
+
+### `MemoryStore()`
+
+On construction, initialize a single field:
+`self._storage: dict[str, dict[int, np.ndarray]] = {}` — a two-level
+dict (`space → addr → ndarray`). Inner dicts are created lazily as new
+spaces appear.
+
+In short, **both facilities' first act is "set up an empty accumulator
+buffer plus a sparse, per-space dict"**. The first record / write
+fills the fields when it arrives.
+
+## Context
+
+ADR-0020 D2/D5/D7 (2-pass data execution) declares:
+
+- During Phase 1 (timing), `ComponentBase._on_process_start/end` hooks
+  call `OpLogger.record_start/end`, recording the time and metadata of
+  every data op.
+- Phase 2 (data) replays the op log in `t_start` order to compute real
+  data.
+- Data payloads live in `MemoryStore`, keyed by (space, addr).
+
+Subsequent ADRs (ADR-0023 D9's IPCQ atomic write, ADR-0027's Megatron
+TP scratch-overwrite avoidance, ADR-0046 D3.2's `tl.store` visibility)
+depend on op_log and MemoryStore behavior, but **the exact record
+fields / space names / snapshot timing** are only discoverable via
+source grep. This ADR codifies them.
+
+## Decision
+
+### D1. `OpRecord` schema — seven fields
+
+```python
+@dataclass
+class OpRecord:
+    t_start: float
+    t_end: float
+    component_id: str
+    op_kind: str               # "memory" | "gemm" | "math" | "unknown"
+    op_name: str               # e.g. "dma_read", "gemm_f16", "exp",
+                               #     "TileToken/DMA_READ", "composite_gemm",
+                               #     "ipcq_copy"
+    params: dict[str, Any]
+    dependency_ids: list[int] = field(default_factory=list)
+```
+
+- **`t_start` / `t_end`**: SimPy time (float ns). `t_start` is when the
+  component begins the op; `t_end` is completion. Duration =
+  `t_end - t_start`.
+- **`component_id`**: the node id where the op occurred (e.g.,
+  `"sip0.cube0.pe0.pe_dma"`).
+- **`op_kind`**: one of four. Phase 2 DataExecutor branches on this.
+- **`op_name`**: a debug/analysis-friendly name. For a TileToken,
+  expands to `"TileToken/{stage_type}"` (e.g.,
+  `"TileToken/DMA_READ"`) to disambiguate stages.
+- **`params`**: op-specific metadata dict (see D3).
+- **`dependency_ids`**: currently unused (default `[]`). Reserved for
+  future cross-op dependency tracking.
+
+### D2. `OpLogger.records` — guaranteed `t_start` sort
+
+```python
+@property
+def records(self) -> list[OpRecord]:
+    self._records.sort(key=lambda r: r.t_start)
+    return self._records
+```
+
+A stable sort by `t_start` runs on each access. Records with the same
+`t_start` preserve insertion order. Aligns with ADR-0020 D5's
+"t_start stable ordering" requirement.
+
+Phase 2 DataExecutor always accesses via the `records` property, so
+even when `record_end` calls arrive out of `t_start` order (e.g., a
+short op started later but finished earlier), the sequence handed to
+Phase 2 is consistent.
+
+### D3. `params` schema per `op_name` (matrix from `_extract_op_info`)
+
+#### D3.1. `op_kind="memory", op_name="dma_read"` (DmaReadCmd)
+
+```python
+{"src_addr": int, "nbytes": int, "handle_id": str}
+```
+
+#### D3.2. `op_kind="memory", op_name="dma_write"` (DmaWriteCmd)
+
+```python
+{
+    "src_space": str,   # handle.space ("tcm"|"hbm"|"sram"), default "tcm"
+    "src_addr": int,    # handle.addr
+    "shape": tuple, "dtype": str,
+    "dst_space": "hbm", # DmaWrite always targets HBM
+    "dst_addr": int,
+    "nbytes": int,
+    "handle_id": str,
+    # When src_space == "hbm" at record_end, a snapshot is added (D4)
+    "snapshot": np.ndarray | None,
+}
+```
+
+#### D3.3. `op_kind="gemm", op_name=f"gemm_{dtype_a}"` (GemmCmd)
+
+```python
+{
+    "src_a_addr": int, "src_b_addr": int, "dst_addr": int,
+    "shape_a": tuple, "shape_b": tuple, "shape_out": tuple,
+    "dtype_in": str, "dtype_out": str,
+    "m": int, "k": int, "n": int,
+    # ADR-0027: per-operand + output spaces preserved
+    "src_a_space": str, "src_b_space": str, "dst_space": str,
+}
+```
+
+#### D3.4. `op_kind="math", op_name=msg.op` (MathCmd; op = "exp", "sum", "add", "where", …)
+
+```python
+{
+    "input_addrs": list[int],   # addrs of input handles
+    "input_shapes": list[tuple],
+    "input_spaces": list[str],
+    "input_dtypes": list[str],
+    "dst_addr": int, "dst_space": str,
+    "shape_out": tuple, "dtype": str,
+    "axis": int | None,         # only meaningful for reductions
+    # All inputs get snapshots at record_end (D4)
+    "input_snapshots": list[np.ndarray | None],
+}
+```
+
+#### D3.5. `op_kind="gemm" or "math", op_name=f"composite_{op}"` (CompositeCmd)
+
+```python
+{
+    "op": str,              # "gemm" | "math"
+    "out_addr": int, "out_nbytes": int,
+    # If op == "gemm", same fields as GemmCmd are added:
+    "src_a_addr": int, "src_b_addr": int,
+    "shape_a": tuple, "shape_b": tuple,
+    "dtype_in": str, "dtype_out": str,
+    "src_a_space": str, "src_b_space": str,
+    "dst_space": "hbm", "dst_addr": int,  # = out_addr
+}
+```
+
+If `op == "gemm"`, `op_kind = "gemm"`; otherwise `"math"`. An alias so
+Phase 2 replays composite-gemm on the same path as `GemmCmd`.
+
+#### D3.6. `op_kind="memory", op_name="ipcq_copy"` (record_copy path)
+
+```python
+{
+    "src_space": str, "src_addr": int,
+    "dst_space": str, "dst_addr": int,
+    "shape": tuple, "dtype": str, "nbytes": int,
+    "snapshot": np.ndarray | None,   # passed by caller; if None, record_copy reads fresh
+}
+```
+
+`PE_DMA._handle_ipcq_inbound` (ADR-0023 D9) emits this record so Phase
+2 can replay the IPCQ slot's inbound copy. It bypasses
+`record_start` / `record_end` and pushes directly via `record_copy()`.
+
+#### D3.7. `op_kind="unknown", op_name=type(msg).__name__`
+
+Fallback for messages `_extract_op_info` doesn't recognize. `params =
+{}`. If DataExecutor encounters this kind, it skips — Phase 2 replay
+is unaffected.
+
+### D4. Snapshot capture timing
+
+When `OpLogger._memory_store` is set, `record_end` performs:
+
+- **Math op**: read every input
+  (addr/shape/space/dtype) from `self._memory_store.read(...)` and
+  attach an ndarray copy to `params["input_snapshots"]`. Read failure
+  → `None`.
+- **`dma_write` op**: snapshot the source **only if `src_space ==
+  "hbm"`** and attach to `params["snapshot"]`. TCM (PE scratch)
+  sources are **deliberately skipped** — TCM is repopulated by Phase 2
+  math/gemm replay, and a Phase-1-time snapshot would capture a
+  previous kernel's stale value (ADR-0027 postmortem: TP gemm →
+  all_reduce race).
+- **`ipcq_copy`**: the caller passes the in-flight snapshot via
+  `snapshot=token.data`. If absent, `record_copy` attempts a fresh
+  read from MemoryStore.
+
+Snapshots are taken with `.copy()` (fresh allocation), making them
+safe against later storage mutation. This is the foundation of
+ADR-0027's "cross-PE Phase 2 ordering" race-avoidance.
+
+When `memory_store` is `None` (Phase 1 timing-only mode), all
+snapshot steps are skipped. Only the timing portion of the record is
+preserved; data replay is unavailable.
+
+### D5. TileToken handling — `record_start` captures stage info
+
+ADR-0014 D6's self-routing tile token (pipeline mode) may have already
+advanced its `stage_idx` by the time `record_end` runs (the TileToken
+caches the next stage's params as it moves to the next component).
+Therefore:
+
+`record_start` pre-saves the following in `pending[id(msg)]["snap"]`:
+
+```python
+snap["stage_type"] = stage.stage_type.name        # "DMA_READ", "GEMM", ...
+snap["stage_params"] = dict(stage.params)         # copy of params at start time
+```
+
+`record_end` retrieves this snap and merges into params:
+
+- Adds `params["stage_type"]` to final params.
+- Merges `stage_params` keys (keeps existing values if any).
+- If `op_name == "TileToken"`, rewrites it to
+  `f"TileToken/{stage_type}"` (e.g., `"TileToken/DMA_READ"`),
+  disambiguating different stages emitted by the same component.
+
+Thanks to this, DMA_READ vs DMA_WRITE, FETCH vs STORE coming from the
+same component (e.g., pe_dma) are distinguishable in reports.
+
+### D6. `MemoryStore` — two-level (space, addr) dict
+
+```python
+class MemoryStore:
+    def __init__(self) -> None:
+        self._storage: dict[str, dict[int, np.ndarray]] = {}
+
+    def write(self, space, addr, data): self._storage[space][addr] = data
+    def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ...
+    def has(self, space, addr) -> bool: ...
+    def snapshot(self) -> MemoryStore: ...
+```
+
+#### D6.1. Space namespace
+
+A string key. Standard values:
+
+- `"hbm"`: HBM data (deploy_tensor + Phase 2 dma_write results).
+- `"tcm"`: PE-local TCM (Phase 2 math/gemm output).
+- `"sram"`: cube-level SRAM (ADR-0023 D9.7's IPCQ slot tier).
+
+Other spaces (e.g., `"reg"`) are allowed — `_storage` is a lazy dict
+that creates a new space when `write` first touches it.
+
+#### D6.2. Address keying
+
+`addr` is an integer. It may be a **physical address (PA) or a virtual
+address (VA)** — `MemoryStore` itself doesn't know address-space
+semantics; it just uses them as keys. Phase 1's `MemoryWriteMsg`
+writes both PA and VA
+(`_create_tensor` zero-inits at PA and at the VA base too); Phase 2
+reads/writes via the addresses captured by op_log.
+
+The caller decides `addr`'s meaning — `MemoryStore` provides only
+lookup.
+
+#### D6.3. read/write semantics — reference store (no copy)
+
+`write(space, addr, data)`: stores the ndarray reference. **No copy.**
+If the caller later mutates the same ndarray, the stored value
+changes.
+
+`read(space, addr, shape=None, dtype=None)`: returns the stored
+ndarray reference. If `shape`/`dtype` are provided:
+
+- `dtype != stored.dtype`: `arr.view(np_dtype)` reinterprets as a
+  view (no copy).
+- `shape != stored.shape`: if `nbytes` matches, `arr.reshape(shape)`
+  is a view.
+- `nbytes` mismatch → `ValueError`.
+
+To detach the data, the caller must call `arr.copy()`. ADR-0027's
+race-avoidance requires explicit `.copy()` in op_log snapshot steps
+for exactly this reason.
+
+#### D6.4. `has(space, addr) -> bool`
+
+Existence check; does not materialize data.
+
+#### D6.5. `snapshot() -> MemoryStore`
+
+Shallow copy. Creates a new instance of inner dicts but shares
+ndarray references. Used at Phase 2 init to fork from Phase 1's
+store, so Phase 2 mutations don't affect Phase 1's remaining
+consumers.
+
+### D7. op_log assumes a single-threaded SimPy
+
+`OpLogger`'s `_records` and `_pending` are lock-free. SimPy is
+single-threaded, so nothing else can intrude between `record_start`
+and `record_end` for the same message.
+
+When multi-process kernbench (ADR-0047 D6) arrives, OpLogger must be
+split per process — one OpLogger instance cannot receive records from
+multiple processes.
+
+## Alternatives Considered
+
+### A1. Externalize op_log to SQLite / parquet
+
+Rejected (currently). The in-memory list minimizes Phase 1 → Phase 2
+hand-off latency. Externalization makes sense for long-running batch
+runs but adds overhead for the current single-run workload.
+
+### A2. Capture snapshots at `record_start`
+
+Rejected. At `record_start`, inputs are often not yet populated (e.g.,
+a math op's input is the output of a just-issued previous op).
+`record_end` is the correct point.
+
+### A3. Per-component MemoryStore
+
+Rejected. The (space, addr) key already disambiguates effectively, and
+splitting per component would complicate cross-PE IPCQ copy (ADR-0023
+D9), which needs access to both source and destination stores.
+
+### A4. Explicit dependency edges in op_log
+
+Partially adopted. The `dependency_ids` field exists on `OpRecord` but
+is currently unused (D1). Phase 2 DataExecutor orders via `t_start` +
+a secondary sort (memory ops before math at the same `t_start`). When
+an explicit dependency graph is required, this field is the home.
+Current ordering rules are sufficient, so it remains unused.
+
+## Consequences
+
+- ADR-0020's op_log / MemoryStore declarations are expanded into the
+  concrete D1–D6 schemas, so writing/modifying Phase 2 DataExecutor
+  doesn't need source-grep to learn field semantics.
+- D3's per-`op_name` params matrix makes adding new ops (e.g., a new
+  reduction type) a question of branching in `_extract_op_info`.
+- D4's per-op snapshot policy (math = input snapshot, dma_write =
+  HBM-only snapshot) is ADR-locked, so ADR-0027's race-avoidance
+  decision won't silently regress on future refactors.
+- D6.3's reference-store semantics are explicit, putting mutation
+  safety on the caller. ADR-0027's explicit `.copy()` pattern is
+  justified.
+- D7's single-thread assumption is recorded, so multi-process
+  kernbench (ADR-0047 D6's supersession candidate) will need OpLogger
+  separation when introduced.
@@ -0,0 +1,351 @@
+# ADR-0053: Topology Builder + Visualizer Algorithms
+
+## Status
+
+Accepted (2026-05-22).
+
+Pins down the key algorithmic choices of the topology compile and
+visualization pipeline jointly implemented by `topology/builder.py`,
+`topology/mesh_gen.py`, and `topology/visualizer.py` —
+placement-driven router attachment, mesh auto-layout, the source_hash
+cache, view projections, and SVG rendering. ADR-0006 defines the
+high-level intent of topology compilation (compiled topology, distance
+extraction, automatic diagram generation), but **which algorithms the
+builder actually uses** was only discoverable via source grep.
+
+## First action
+
+When `resolve_topology(path_str)` is called, four steps run in order:
+
+1. **Path validation** (`builder.py::resolve_topology`):
+   `Path(path_str).expanduser().resolve()`, existence check, file
+   check. Failure → `FileNotFoundError` or `ValueError`.
+2. **YAML parsing** (`_read_spec`): `yaml.safe_load`. Parse errors
+   yield a `ValueError` with line/column. Non-dict roots are
+   rejected.
+3. **Auto-generate the mesh** (`mesh_gen.ensure_mesh_file`): create or
+   reuse a `cube_mesh.yaml` next to the topology file. Cache hit on
+   matching source_hash; miss triggers regeneration. This step decides
+   the cube NoC's router grid and attachment information.
+4. **Compile the graph** (`_compile_graph`): system → IO chiplets →
+   cubes → inter-cube edges → IO↔cube edges → system↔IO edges, then
+   build four view projections (system, sip, cube, pe) and wrap into
+   a `TopologyGraph`.
+
+In short, **topology compilation's first act is "read topology.yaml as
+a dict, create/validate cube_mesh.yaml in the same directory, then
+build the flat graph + 4-view projection in system → sip → cube → pe
+order"**.
+
+## Context
+
+`topology/` package responsibilities:
+
+- **builder.py** (1207 lines): turns topology.yaml into a
+  `TopologyGraph` (nodes + edges + 4 view projections).
+- **mesh_gen.py** (305 lines): auto-decides the cube NoC's router
+  grid and PE/UCIe/M_CPU/SRAM attachment positions and caches them in
+  `cube_mesh.yaml`.
+- **visualizer.py** (887 lines): generates four SVG diagrams (system /
+  sip / cube / pe) from a `TopologyGraph`.
+
+ADR-0006 makes the high-level decision that "the result of topology
+compilation is the single source for distance metadata and diagram
+generation", but specific algorithms (e.g., placement-driven nearest-
+router attachment, the HBM exclusion zone, which fields in source_hash
+trigger regeneration) are not in any ADR.
+
+In particular, these decisions are absent at ADR level:
+
+- Why is mesh_gen cached in a separate file (`cube_mesh.yaml`)?
+- Which fields are in source_hash, and which changes force
+  regeneration?
+- Why placement coordinates in mm rather than cube coordinates?
+- How are the HBM exclusion zone and UCIe N/S/E/W distribution
+  decided inside the mesh?
+- What is the abstraction-level difference among the four view
+  projections (system/sip/cube/pe)?
+
+This ADR captures these decisions in one place.
+
+## Decision
+
+### D1. Compile pipeline — six stages
+
+`_compile_graph(spec)`:
+
+1. **System nodes** (`_instantiate_system`): add system-level nodes
+   like `fabric.switch0` and the host CPU.
+2. **Per-SIP loop** (`for sip_id in range(system.sips.count)`):
+   - **IO chiplets** (`_instantiate_io_chiplets`): create pcie_ep /
+     io_cpu / io_noc / io_ucie PHYs / conn nodes and their bidirectional
+     internal edges.
+   - **Cube instantiation** (`_instantiate_cube`): using
+     cube_mesh.yaml's router grid, instantiate cube routers, PE
+     sub-components (pe_cpu, pe_dma, pe_fetch_store, pe_gemm, pe_math,
+     pe_mmu, pe_tcm, pe_scheduler, pe_ipcq), m_cpu, sram, hbm_ctrl,
+     and their internal edges.
+   - **Inter-cube edges** (`_add_inter_cube_edges`): the UCIe
+     N/S/E/W mesh edges.
+   - **IO ↔ cube edges** (`_add_io_to_cube_edges`): connect io_noc to
+     each cube's edge UCIe phy.
+3. **Switch ↔ IO edges** (`_add_system_to_io_edges`): bidirectional
+   edges between `fabric.switch0` and each SIP's `pcie_ep` (the
+   cross-SIP IPCQ path of ADR-0038 D3 + ADR-0010).
+4. **Build four view projections**:
+   - `_build_system_view(spec)` — Tray level: SIPs and the system
+     switch.
+   - `_build_sip_view(spec)` — inside one SIP: cube mesh + IO
+     chiplet.
+   - `_build_cube_view(spec)` — inside one cube: router grid + PE /
+     M_CPU / SRAM / HBM_CTRL attachments.
+   - `_build_pe_view(spec)` — inside one PE: nine sub-components +
+     internal edges (pe_internal kind).
+5. **Return `TopologyGraph`**: `TopologyGraph(spec, nodes, edges,
+   system_view, sip_view, cube_view, pe_view)`.
+
+The six stages are **ordered for a reason**: only after cubes exist
+do inter-cube edges have valid src/dst, and IO chiplets must precede
+the IO ↔ cube edges that reference them. New node types must slot in
+the right spot.
+
+### D2. `cube_mesh.yaml` — a separate file with a source_hash cache
+
+`mesh_gen.ensure_mesh_file(cube_spec, mesh_path)`:
+
+1. Compute `source_hash = _compute_source_hash(cube_spec)` from these
+   input fields:
+   - `geometry` (cube_mm.w/h …).
+   - `pe_layout` (corners, pe_per_corner).
+   - `ucie.n_connections`.
+   - `memory_map.hbm_mapping_mode`.
+   - `placement` (m_cpu/sram pos_mm).
+2. If `mesh_path` (= `cube_mesh.yaml` next to topology.yaml) exists
+   and `existing.source_hash == source_hash`, reuse it (cache hit).
+3. Otherwise, generate a new mesh via
+   `_generate_mesh(cube_spec, source_hash)` and write to yaml.
+
+Caching as a separate file because:
+
+- Mesh generation involves nontrivial PE/UCIe/router attachment math
+  and is too expensive to redo every time.
+- Multiple runs with the same cube spec must guarantee an identical
+  mesh.
+- The resulting mesh is itself an inspectable / debuggable artifact.
+
+The five fields listed in source_hash are the ones that determine
+mesh shape; other changes (e.g., bandwidth, overhead_ns) do not
+trigger mesh regeneration.
+
+### D3. Cube NoC mesh auto-layout
+
+`_generate_mesh(cube_spec)`:
+
+#### D3.1. Rows / columns
+
+- `pe_positions = _corner_pe_positions(cube_w, cube_h)`: PE-center
+  coordinates (mm) per corner (NW/NE/SW/SE). Hardcoded patterns like
+  `(1.5, 1.5)` and `(cube_w-1.5, cube_h-1.5)`; with `pe_per_corner=2`,
+  each corner has two PE positions.
+- `col_xs = _compute_col_positions(...)`: union of PE x-coordinates,
+  plus relay columns inserted when any gap exceeds
+  `max_spacing = 3.0 mm`.
+- `row_ys, rows_per_half = _compute_row_positions(cube_h,
+  n_connections, pe_positions)`:
+  - `n_conn = max(n_connections, 2)` (hot-path minimum).
+  - `rows_per_half = ceil(n_conn / 2)`.
+  - Top half + two HBM rows + bottom half. HBM sits at
+    `(cube_h/2 - 1.5, cube_h/2 + 1.5)`. The gap between PE rows and
+    HBM rows is `hbm_gap = 1.5 mm`.
+
+#### D3.2. HBM exclusion zone
+
+`hbm_row_start = rows_per_half`,
+`hbm_row_end = rows_per_half + 1`.
+`hbm_col_start = n_cols // 2 - 1`,
+`hbm_col_end = n_cols // 2`.
+
+Router slots inside this (row, col) rectangle are marked `None` (no
+router). HBM controllers are added separately as
+`hbm_ctrl.pe{X}` nodes following ADR-0017 D9's per-PE partition
+pattern.
+
+#### D3.3. PE attachment
+
+Each corner's PEs map to a row:
+
+- Top half: NW → row 0, NE → row 1 (top_corners index).
+- Bottom half: SW → row `hbm_row_end + 1`, SE → row
+  `hbm_row_end + 2`.
+
+Each PE's x-coordinate attaches to the nearest column's router
+(`min(range(n_cols), key=lambda c: abs(col_xs[c] - pe_x))`).
+Attachment items are `pe{pe_idx}.dma`, `pe{pe_idx}.cpu`,
+`pe{pe_idx}.hbm` (pushed into the router's attach list).
+
+#### D3.4. M_CPU / SRAM attachment — nearest router by Euclidean distance
+
+For `placement.m_cpu.pos_mm` (default `[1.5, 5.5]`) and
+`placement.sram.pos_mm` (default `[1.5, 8.5]`), find the router with
+the smallest Euclidean distance and append `"m_cpu"` / `"sram"` to
+its attach list.
+
+#### D3.5. UCIe N/S/E/W distribution
+
+`ucie_pe_rows = top_pe_rows + bot_pe_rows` (total
+`2 * rows_per_half`).
+
+- UCIe-E: one PE row at a time, attach `ucie_e.c{i}` to the rightmost
+  column's router.
+- UCIe-W: attach `ucie_w.c{i}` to the leftmost column's router (E's
+  mirror).
+- UCIe-N/S: split PE columns into left and right halves; attach to
+  the top row's / bottom row's matching columns.
+
+Each UCIe connection is suffixed `c{i}`, distributing
+ucie_n_connections PHYs (ADR-0017 D5+).
+
+### D4. Node naming convention — single ownership
+
+builder.py creates nodes with the following naming convention (the
+single-owner principle from ADR-0051 D5):
+
+- `fabric.switch0` — system-level switch.
+- `sip{S}.{io_id}.{pcie_ep|io_cpu|io_noc|io_ucie.{dir}|conn.{id}}` —
+  IO chiplet.
+- `sip{S}.cube{C}.{m_cpu|sram|hbm_ctrl.pe{X}|noc.r{R}c{C}|...}` —
+  inside cube.
+- `sip{S}.cube{C}.pe{P}.{pe_cpu|pe_dma|pe_fetch_store|pe_gemm|pe_math|pe_mmu|pe_tcm|pe_scheduler|pe_ipcq}` —
+  PE sub-components.
+
+Changing this convention requires updating both builder.py and
+router.py's helpers (ADR-0051). Components never know the convention
+directly — they only call the helpers.
+
+### D5. Edge `kind` classification
+
+Every edge gets a `kind`; routing policy (ADR-0051 D2) reads it. Major
+kinds:
+
+- `"pe_internal"` — within a PE between sub-components.
+- `"pe_to_router"` — PE_DMA ↔ cube NoC router.
+- `"router_mesh"` — between cube NoC routers.
+- `"router_to_hbm"`, `"router_to_mcpu"`, `"router_to_sram"`,
+  `"sram_to_router"`, etc. — between cube-attached components.
+- `"ucie_internal"`, `"ucie_conn_to_router"`,
+  `"router_to_ucie_conn"`, `"ucie_conn_to_noc"`,
+  `"noc_to_ucie_conn"`, `"ucie_mesh"` — UCIe-related.
+- `"io_internal"` — inside IO chiplet.
+- `"io_to_cube"`, `"cube_to_io"` — at the IO ↔ cube boundary.
+- `"pcie"` — switch ↔ pcie_ep.
+- `"command"` — control-plane edges only (e.g., M_CPU ↔ NOC; excluded
+  from PE DMA paths).
+
+Adding a new edge kind requires picking a category in router.py's
+four adjacency graphs (ADR-0051 D2). If you forget, it defaults to
+`_adj_all` only, which can produce unintended routes.
+
+### D6. View projection — four abstraction levels
+
+`TopologyGraph` keeps four view projections alongside the flat
+nodes+edges:
+
+- **system_view** (`_build_system_view`): Tray level. SIP blocks and
+  `fabric.switch0`. PCIe links shown. For external high-level
+  overview.
+- **sip_view** (`_build_sip_view`): inside one SIP — cube mesh + IO
+  chiplet (pcie_ep + io_cpu + io_noc). UCIe N/S/E/W appear as
+  cube-cube links.
+- **cube_view** (`_build_cube_view`): inside one cube — router grid +
+  PE / M_CPU / SRAM / HBM_CTRL attachments + UCIe PHY edges. For
+  intra-cube routing / placement debugging.
+- **pe_view** (`_build_pe_view`): inside one PE — nine sub-components
+  + internal edges (pe_internal kind). For detailed PE-internal
+  dataflow review.
+
+Views are selectively rendered via the spec's
+`visualization.emit_views: [system, sip, cube]` (ADR-0006). The pe
+view is omitted from default output but the code is retained for
+detailed debugging.
+
+### D7. visualizer.py — SVG diagram output
+
+`emit_diagrams(graph, out_dir)` renders every view as SVG. Key
+functions:
+
+- `_render_view_svg(view)` — generic view render (no router grid).
+- `_render_cube_view_svg(view, spec)` — cube-view specific (HBM block,
+  router grid layout, PE/M_CPU/SRAM/HBM placement).
+- `_draw_node`, `_draw_edge` — node/edge visual representation.
+- `_pick_scale`, `_compute_node_sizes` — auto-scaling.
+
+The visualizer is a **derived artifact** (ADR-0006); changes here do
+not pass production checks. Aligns with CLAUDE.md's "Derived
+Artifacts" guidance.
+
+### D8. Blast radius of spec changes
+
+| spec field                            | effect              | mesh regenerated? |
+|---------------------------------------|---------------------|-------------------|
+| `system.sips.count`                   | SIP count, node count | No                |
+| `sip.cube_mesh.w/h`                   | cube mesh shape     | No                |
+| `cube.geometry.cube_mm.w/h`           | cube size (mm)      | **Yes**           |
+| `cube.pe_layout.corners/pe_per_corner`| PE attachment positions | **Yes**       |
+| `cube.ucie.n_connections`             | UCIe PHY distribution | **Yes**         |
+| `cube.memory_map.hbm_mapping_mode`    | HBM distribution mode | **Yes**         |
+| `cube.placement`                      | M_CPU/SRAM positions | **Yes**          |
+| `cube.memory_map.*` (besides above)   | HBM capacity / BW   | No                |
+| `*.links.*.bw_gbs`                    | edge bandwidth      | No                |
+| `*.attrs.overhead_ns`                 | component latency   | No                |
+
+The table mirrors D2's `_compute_source_hash` inputs. Changes that
+require mesh regeneration automatically invalidate `cube_mesh.yaml`'s
+source_hash.
+
+## Alternatives Considered
+
+### A1. Regenerate the mesh on every compile without a cache file
+
+Rejected. The cost of mesh generation would be paid repeatedly (CLI
+runs, probe, tests) for the same spec, and the human-inspectable
+artifact would disappear.
+
+### A2. Merge mesh generation into builder.py
+
+Rejected (currently). It is a 305-line algorithm of its own, and the
+mesh-layout decisions (placement-driven router attachment, HBM
+exclusion zone) are different from builder's general node/edge
+emission. Keeping it separate respects single-responsibility.
+
+### A3. Express placement coordinates in cube coordinates (col/row)
+
+Rejected. mm coordinates flow consistently between the visualizer and
+mesh layout (for nearest-router computation). Cube coordinates are
+undefined until the router grid is fixed, so they are unsuitable as
+placement input.
+
+### A4. Lazy view projection generation
+
+Rejected (currently). The four views are cheap to build (typically <
+100 ms), and eager construction guarantees `TopologyGraph` as the
+single source of truth.
+
+### A5. Visualizer output in formats besides SVG (PNG/PDF)
+
+Rejected. SVG is vector + text-searchable + directly renderable in
+browsers. PNG conversion, when required, is downstream
+post-processing (e.g., rsvg-convert).
+
+## Consequences
+
+- ADR-0006's high-level intent is fleshed out via D1–D7; topology
+  changes can be assessed quickly via D8's table.
+- D3's mesh-layout algorithm is ADR-locked, so future PE attachment
+  patterns (e.g., a 6-zone HBM split) make clear which stage they
+  affect.
+- D5's edge-kind list and D7's view structure are explicit, giving PR
+  reviewers a quick map of where (builder + router + visualizer) a
+  new component type ripples through.
+- D2's source_hash invalidation rules are explicit, so a stale
+  `cube_mesh.yaml` (e.g., when only bandwidth changed) is recognized
+  as correct behavior.
@@ -0,0 +1,143 @@
+# ADR-0054: Milestone Eval Benches — self-contained sweep + figure benches
+
+## Status
+
+Accepted (2026-05-22).
+
+Amends ADR-0044 (D1/D2) and ADR-0045 (D5) and supersedes the "logic lives
+in `scripts/` + `tests/`" arrangement of ADR-0043/0044: the GEMM and
+allreduce evaluation harnesses are now self-contained **benches** that a
+user runs to regenerate every result + figure.
+
+## Context
+
+ADR-0043 (allreduce eval) and ADR-0044 (GEMM eval) split each harness into
+a **sweep** (a manual `scripts/` driver, or — for allreduce — the
+parametrized tests themselves) plus **figure tests** that render committed
+data. The sweep/render logic therefore lived under `scripts/gemm_sweep.py`,
+`tests/gemm/_gemm_plot_helpers.py`, and `tests/sccl/_allreduce_helpers.py`.
+
+A milestone requirement ("refactor allreduce + GEMM evaluation so a user
+can run *one bench* to generate all the results and plots") cannot be met
+by that layout: a bench is production code and **must not import from
+`tests/`** (ADR-0007 layer direction). The eval logic had to move into
+production, reachable from a bench.
+
+The chosen home is the bench module itself — not a separate
+`kernbench.eval` package. A bench file may contain arbitrary module-level
+code; collapsing the harness into the bench keeps one file per domain and
+avoids an extra package layer.
+
+## Decision
+
+### D1. Two milestone benches own the eval logic
+
+- `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep +
+  the three figure renderers (moved from `scripts/gemm_sweep.py` +
+  `tests/gemm/_gemm_plot_helpers.py`).
+- `src/kernbench/benches/milestone_1h_ccl.py` — the distributed allreduce
+  driver, latency + buffer-kind sweeps, topology diagram, FSIM comparison,
+  and the direct-launch parity reference (moved from
+  `tests/sccl/_allreduce_helpers.py`).
+
+Each file is the **single home** for its domain's eval logic.
+
+### D2. The "eval bench" pattern (extends ADR-0045 D5)
+
+ADR-0045 D5 fixed a bench to a single configuration (single-SIP, or the
+ADR-0024 multi-SIP CCL exception). This ADR adds a third pattern:
+
+- An **eval bench** may drive *many* configurations and render figures. It
+  builds its own per-config `GraphEngine` / `RuntimeContext` instances
+  (one per sweep point) rather than using the outer `run_bench` engine.
+- Because the outer ctx then has no submitted handles, the bench submits a
+  **sentinel tensor** (`torch.zeros((1, 1), …)`) at the end to satisfy
+  `run_bench`'s "must submit at least one request" contract (ADR-0045 D4),
+  so the CLI exits 0.
+
+### D3. Output location
+
+Both benches write to `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`
+(per user request — artifacts beside the bench). The directory holds only
+generated PNG/CSV/JSON (never a `.py`/`__init__.py`), so the eager-import
+audit (ADR-0045 first action) ignores it — `pkgutil.iter_modules` does not
+yield non-package subdirectories. It is **committed** (like the
+`docs/diagrams/` artifacts) so the figures are viewable on the remote;
+rerunning the bench regenerates it in place.
+
+### D4. GEMM heavy sweep — fresh by default, `MILESTONE_FAST` to reuse
+
+`milestone-1h-gemm` runs the full 24-sim sweep by default (minutes; one
+shape is 2048 tiles). `MILESTONE_FAST=1` reuses the committed
+`docs/diagrams/gemm_sweep.json` and only re-renders (seconds). This
+reverses ADR-0044 D1/D2's "heavy sweep stays a manual/`slow`-marked step":
+running the bench *is* the regeneration. The slow path is exercised by a
+`@pytest.mark.slow` bench test; the fast path runs by default.
+
+### D5. Tests + script reuse via thin re-export shims (single home kept)
+
+The pre-existing figure tests and the `scripts/gemm_sweep.py` entry point
+are retained and now reuse the bench modules:
+
+- `tests/gemm/_gemm_plot_helpers.py` → re-exports the renderers +
+  `GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT` from
+  `kernbench.benches.milestone_1h_gemm`.
+- `tests/sccl/_allreduce_helpers.py` → re-exports the driver core, config
+  writers, sweep constants, renderers, and disk aggregators from
+  `kernbench.benches.milestone_1h_ccl`, and keeps the **pytest-only** pieces
+  local: the `pytest.param` matrices (`CONFIGS` / `_sweep_params` /
+  `_bk_params`) and the fixture-coupled `_run_distributed`
+  (`monkeypatch.chdir` + `_drive_distributed`) wrapper.
+- `scripts/gemm_sweep.py` → thin wrapper over the bench's `run_sweep`.
+
+Tests importing a bench module is permitted (tests sit above production,
+ADR-0007); it triggers the whole-package eager audit, which already runs on
+every `kernbench` invocation. matplotlib stays lazily imported inside the
+renderers, so the audit's startup cost is unchanged.
+
+### D6. Flat module naming (no `benches/` subfolder)
+
+A `benches/` subpackage named `1H_milestone…` is impossible — a Python
+package name cannot start with a digit. The benches are therefore flat
+modules `milestone_1h_gemm.py` / `milestone_1h_ccl.py` with bench names
+`milestone-1h-gemm` / `milestone-1h-ccl` (kebab-case, letter-first per
+ADR-0045 D1).
+
+## Consequences
+
+### Positive
+
+- `kernbench run --bench milestone-1h-gemm` (or `…-ccl`) regenerates all of
+  a domain's results + figures in one command — the milestone requirement.
+- Single source for the eval logic (the bench), reused by tests and the
+  script via shims; no duplication.
+- The figure tests and `scripts/gemm_sweep.py` keep working unchanged.
+
+### Negative / limitations
+
+- The two bench files are large (the CCL one mixes the distributed driver,
+  sweeps, and matplotlib drawing). A "bench" that is mostly an eval harness
+  is unusual; this ADR legitimizes it.
+- Generated artifacts live inside the source tree (`src/kernbench/benches/`)
+  by explicit request and are committed (so the figures are viewable on the
+  remote); rerunning the bench regenerates them.
+- `milestone-1h-ccl` (and the default `milestone-1h-gemm`) take minutes —
+  acceptable for an on-demand milestone artifact, not for routine runs.
+
+## Dependencies
+
+- **ADR-0007**: layer direction (why tests may import production but a bench
+  may not import tests).
+- **ADR-0043 / ADR-0044**: the allreduce / GEMM eval harnesses this ADR
+  relocates into benches.
+- **ADR-0045**: bench module contract; D2 here extends its D5 (single-device
+  rule) with the eval-bench pattern, and relies on D4 (NO_REQUESTS) for the
+  sentinel.
+- **ADR-0024**: rank = SIP launcher driven by the allreduce sweeps.
+
+## Open questions
+
+- Should the GEMM theoretical-model constants (ADR-0044 D5) be sourced from
+  ADR-0033/0014 rather than copied? Unchanged by this ADR.
+- Should `build_overview_slides.py` consume the milestone output PNGs
+  instead of drawing GEMM bars natively? Still open (ADR-0044 D6 / Negative).
@@ -0,0 +1,175 @@
+# ADR Index
+
+Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **47**.
+
+Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
+
+## Design Principles
+
+- [ADR-0013](./ADR-0013-ver-verification-strategy.md) — Verification Strategy and Phase 1 Test Plan
+- [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — Latency Model: Assumptions and Known Simplifications
+
+## High-level Architecture
+
+- [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — Target System Hierarchy & Modeling Scope  _(System hierarchy (Tray / SIP / CUBE / PE))_
+- [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — Runtime API and Simulation Engine Boundaries  _(Runtime API ↔ sim_engine boundaries)_
+- [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NOC and Memory Data Path  _(IOChiplet NOC and memory data path)_
+- [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — Cube NOC and HBM Connectivity  _(Cube NOC and HBM connectivity)_
+
+## Detailed Architecture
+
+One subsection per component file under `src/kernbench/components/builtin/`.
+
+### forwarding
+
+- [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding Component (forwarding_v1)
+
+### hbm_ctrl
+
+- [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM Controller Internal Design
+
+### io_cpu
+
+- [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU Component Model
+
+### m_cpu
+
+- [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU and M_CPU.DMA Component Model
+
+### pcie_ep
+
+- [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
+
+### pe_cpu
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+
+### pe_dma
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
+
+### pe_fetch_store
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+
+### pe_gemm
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+
+### pe_ipcq
+
+- [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
+
+### pe_math
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+
+### pe_mmu
+
+- [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — Component + Utility Dual Role
+
+### pe_scheduler
+
+- [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
+
+### pe_tcm
+
+- [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — Dual-Channel BW Serialization
+
+### sram
+
+- [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
+
+### tiling
+
+- [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math Pipeline Plan Builders
+
+## Implementation Decisions
+
+### Address Scheme
+
+- [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51-bit Physical Address Layout & Decoding Contract
+- [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — Memory Addressing — PA / VA / LA Address Models
+
+### Routing & Helper API
+
+- [ADR-0002](./ADR-0002-lat-routing-distance.md) — Routing Distance, Ordering & Bypass Rules
+- [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
+
+### Memory Semantics & Local-HBM Bandwidth
+
+- [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — Memory Semantics & Local-HBM Bandwidth Guarantee
+
+### Topology Compilation, Diagrams & Builder Algorithms
+
+- [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — Diagram Views & Distance-Aware Layout Rules
+- [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — Topology Compilation, Distance Extraction, and Automatic Diagram Generation
+- [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
+
+### Tensor Deployment and Allocation
+
+- [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — Tensor Deployment and Allocation (Host Allocator, PA-first)
+
+### Kernel Execution and Host-Device Messaging
+
+- [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — Kernel Execution Messaging and Completion Semantics
+- [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
+
+### CLI Surface and Semantics
+
+- [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — Command Line Interface and Execution Semantics
+
+### Component Port/Wire Fabric Model
+
+- [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — Component Port/Wire Model and Fabric Routing
+
+### Two-Pass Data Execution
+
+- [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass Data Execution Model (Timing / Data Separation)
+
+### 2D Grid Program Identity
+
+- [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D Grid program_id Semantics
+
+### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
+
+- [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
+- [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — remove sip/num_sips fields
+- [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
+- [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
+- [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
+
+### IPCQ Direction Addressing
+
+- [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
+
+### Intercube All-Reduce
+
+- [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
+
+### Evaluation Harnesses
+
+- [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce Evaluation Harness — `tests/sccl/`
+- [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/`
+- [ADR-0054](./ADR-0054-eval-milestone-benches.md) — Milestone Eval Benches — self-contained sweep + figure benches
+
+### Bench Module Contract
+
+- [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
+
+### Kernel-side tl.* API (TLContext)
+
+- [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
+
+### Memory Allocator Algorithms
+
+- [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
+
+### Probe Subcommand
+
+- [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
+
+### Sim-engine Op Log and Memory Store Schemas
+
+- [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
@@ -0,0 +1,836 @@
+# KernBench — Architecture Design Document
+*2026 1H*
+
+KernBench is a system-level, discrete-event simulator for AI-accelerator
+chiplet systems. It models the data-movement and control paths across
+the full hardware hierarchy and reports end-to-end execution latency
+for kernels dispatched to the device's compute units.
+
+This document is a public summary of the architecture as designed and
+implemented in the first half of 2026. It assumes no prior knowledge of
+the simulator's internal documents; terms specific to the system are
+defined on first use.
+
+---
+
+## Design Principles
+
+KernBench is grounded in two foundational commitments: every measured
+latency must trace to explicit, modeled events on the simulator's graph,
+and every behavioral claim must be verifiable through tests that target
+spec-level invariants rather than incidental implementation details.
+
+<!-- src: ADR-0013 Context, Decision -->
+The verification posture is verification-driven. Tests are written to
+validate the architectural contracts that the simulator exposes —
+correct routing, deterministic results, monotonic latency under
+increasing hop counts — rather than to mirror the call graph of the
+implementation. Two phases coexist: a fast timing phase that exercises
+the simulator's discrete-event engine and produces a log of operations
+with timestamps, and an optional data-replay phase that uses that log
+to compute real numerical results. Tests can target either phase.
+
+<!-- src: ADR-0033 Context, Decision -->
+The latency model is intentionally abstract rather than
+cycle-accurate. Each modeled node contributes a configurable per-node
+overhead, each link contributes wire delay plus byte-over-bandwidth
+serialization, and each terminal service contributes its own service
+time. The simulator does not attempt to reproduce cache coherence
+protocols, microarchitectural pipelines, or full PCIe/UCIe protocol
+correctness; those are explicitly outside the scope. The aim is a
+simulator that compares system-level configurations meaningfully and
+deterministically, not one that ships microarchitectural truths.
+
+<!-- src: ADR-0033 Decision, Consequences -->
+Determinism is a hard requirement. Given identical inputs — topology,
+routing policy, and request stream — the simulator must produce
+identical outputs, hop traces included. This rules out reliance on
+unordered set iteration on the critical path and forces every latency
+contribution to come from an explicitly scheduled event on a modeled
+component or link. There are no implicit waits, no hardcoded magic
+delays, and no shortcuts that bypass the modeled graph.
+
+---
+
+## High-level Architecture
+
+<!-- src: ADR-0003 Context, Decision -->
+The simulated system is a four-level hierarchy. A **Tray** holds one or
+more **SIPs** (system-in-package), each containing a 2D mesh of
+**CUBEs** plus one or more **IO chiplets** that connect the SIP to the
+host. Each CUBE contains a regular grid of **PEs** (processing
+elements) plus its own attached resources — high-bandwidth memory
+(HBM), a per-cube SRAM scratchpad, and a management CPU (M_CPU). The PE
+itself is a composite of nine sub-components rather than a monolithic
+core. This hierarchy is fixed; the parameters along each axis (counts,
+mesh dimensions, link widths) are configurable through the topology
+spec.
+
+<!-- src: ADR-0007 Context, Decision -->
+A clean separation runs along the request flow. A **runtime API** at
+the top is the host-facing surface; it exposes tensor and kernel
+operations, owns host-side allocation metadata, and is topology-
+agnostic — it does not route or fan out. Below it the **simulation
+engine** decomposes runtime operations into discrete graph requests
+(memory writes, memory reads, kernel launches, MMU map installs) and
+schedules events deterministically. At the bottom, **components** model
+device behavior on a graph of nodes connected by links; they
+implement the actual latency contributions and pass requests along.
+No component reaches up into the runtime API, and no runtime call
+shortcuts the engine.
+
+<!-- DIAGRAM: Four-level system hierarchy — Tray containing SIPs, each SIP showing its 2D cube mesh and IO chiplet; one cube blown up to show the router mesh, attached PEs, M_CPU, SRAM, and HBM partition. -->
+
+### Tray
+
+<!-- src: ADR-0003 Decision -->
+The Tray is the outermost boundary. It owns the host CPU on one side
+and one or more SIPs on the other, connected through a fabric switch.
+For collective communication that must traverse multiple SIPs, the
+fabric switch acts as the common rendezvous: device-side outbound
+traffic from one SIP routes through the switch and back into the
+target SIP's IO chiplet.
+
+### SIP
+
+<!-- src: ADR-0003 Decision, ADR-0017 Context -->
+A SIP packages a 2D mesh of CUBEs and one or more IO chiplets. The
+default topology used by the simulator is a 4×4 cube mesh; the
+mesh dimensions are configurable. Each cube on the boundary of the
+mesh connects to its neighbors over UCIe (die-to-die) links arranged
+on the four cardinal sides — north, south, east, and west. The IO
+chiplets sit on one side of the SIP and provide the bridge to the host
+across PCIe.
+
+<!-- src: ADR-0016 Context, Decision -->
+The IO chiplet itself contains its own internal network. A
+host-facing PCIe endpoint passes traffic to a small NOC ("network on
+chip"); from there it can branch to a control-plane CPU that processes
+kernel-launch messages, or it can take the direct memory data path to
+the cube's HBM controller. The decision to provide a direct memory
+path that bypasses the control CPU was a deliberate concession to
+keep host-issued memory writes from paying control-plane overhead on
+the data path.
+
+### CUBE
+
+<!-- src: ADR-0017 Decision -->
+Each CUBE owns a 2D mesh of NOC routers and a set of attached
+resources: PEs, the cube-local SRAM scratchpad, the management CPU
+(M_CPU), and the HBM partition (split across multiple PE-private
+slices for bandwidth). The router mesh uses deterministic XY routing.
+Attached components do not connect to each other directly — they all
+sit on the router mesh, and every cube-internal transfer pays the
+mesh distance from source to destination.
+
+<!-- src: ADR-0017 Decision -->
+The HBM partition is per-PE: each PE owns one HBM slice, and the
+controller exposes per-PE channels so that the same PE always
+addresses the same set of HBM channels. This makes the local-HBM
+bandwidth from a PE to its own slice predictable, while accesses to
+another PE's slice — or a different cube's slice — pay the mesh
+distance and any UCIe crossings.
+
+### PE
+
+<!-- src: ADR-0014 Context, Decision -->
+A PE is not a monolithic core. Internally it is a set of nine
+sub-components, each modeling one stage of a request's flow: a small
+control CPU, a tile-pipeline scheduler, a DMA engine, a fetch-store
+engine that moves data between the on-PE scratchpad and the register
+file, a GEMM compute engine, a math compute engine, the tightly-
+coupled memory (TCM, the on-PE scratchpad), an MMU for virtual-to-
+physical address translation, and an inter-PE collective queue
+(IPCQ). The scheduler decomposes higher-level operations into per-tile
+stage sequences, and tile tokens self-route from one sub-component
+to the next.
+
+<!-- DIAGRAM: PE internal layout — the nine sub-components and the edges that connect them; tile token flowing through DMA_READ → FETCH → GEMM → STORE → DMA_WRITE. -->
+
+---
+
+## Detailed Architecture
+
+This section describes each modeled device-side component in turn.
+Components are listed in the alphabetical order used by the
+simulator's source tree.
+
+### forwarding
+
+<!-- src: ADR-0037 Context, Decision -->
+The forwarding component is the generic routing relay used wherever a
+node only needs to apply a small processing overhead and pass the
+request to the next hop. NOC routers, conn nodes, and ucie phys all
+reduce to this. Its first act on receiving a request is to apply the
+per-node overhead configured for it in the topology spec; after the
+overhead it simply hands the request to the next hop along the path.
+
+<!-- src: ADR-0037 Decision, Consequences -->
+The decision to share one implementation across these roles was made
+to keep the simulator's component set small without sacrificing
+modeling fidelity. Each instance still carries its own overhead and
+its own link bandwidth contributions, so different roles still produce
+different timing. What is shared is the dispatcher loop, not the
+parameter values.
+
+### hbm_ctrl
+
+<!-- src: ADR-0034 Context, Decision -->
+The HBM controller is the terminal node for all memory traffic that
+reaches HBM. Internally it owns a number of pseudo channels, partitioned
+per-PE so that each PE addresses a deterministic subset. On a request
+arrival the controller first selects the right pseudo channel from the
+target address, then enters a chunk-loop that drains the requested
+size in fixed-size flits over the channel's bandwidth.
+
+<!-- src: ADR-0034 Decision, Consequences -->
+The chunk-loop pattern replaces an earlier all-at-once drain. The
+benefit is that the controller no longer presents a flit-aware fabric
+with a single bulk transfer; instead it emits flits at a paced rate
+matching the channel bandwidth, which makes cross-flow contention
+visible. The bandwidth budget is calibrated against the configured
+HBM total bandwidth divided across the channel count.
+
+### io_cpu
+
+<!-- src: ADR-0036 Context, Decision -->
+The IO_CPU is the control-plane processor sitting inside the IO chiplet.
+It receives kernel-launch messages from the host, decodes them, and
+dispatches per-cube launches to the cube's management CPU. Pure memory
+operations bypass it entirely, taking the direct data path established
+inside the IO chiplet.
+
+<!-- src: ADR-0036 Decision -->
+On receiving a kernel-launch message, the IO_CPU consults the message's
+shard list — which already names the target SIP, cube, and PE for each
+piece of the tensor argument — and forwards a per-cube launch to each
+cube the kernel needs to reach. This makes the IO_CPU a deterministic
+fan-out point: it does not decode physical addresses to route, it just
+follows the explicit per-shard targets it was handed.
+
+### m_cpu
+
+<!-- src: ADR-0035 Context, Decision -->
+The M_CPU is the cube's management processor. It owns two distinct
+roles: as a control-plane fan-out point for kernel launches arriving
+from the IO chiplet, and as a DMA endpoint for host-initiated memory
+writes that need to land in this cube's HBM. The control role
+forwards launches to the right PE control CPUs; the DMA role places
+the actual bytes into HBM through the router mesh.
+
+<!-- src: ADR-0035 Decision -->
+The component model deliberately distinguishes the two roles because
+their routing differs: the control fan-out path uses command-kind
+links that do not appear on data-path routes, while the DMA path uses
+the same router mesh as PE-initiated DMA, with PE-internal nodes
+excluded. The routing layer knows about both modes and selects the
+appropriate adjacency at request time.
+
+### pcie_ep
+
+<!-- src: ADR-0038 Context, Decision -->
+The PCIE endpoint is the protocol boundary at the host-device edge.
+Its first act on each incoming request is to apply a configured
+protocol-processing overhead; after that it simply forwards. There is
+no internal queuing model, no retry, and no TLP-level fidelity — those
+are deliberately outside scope. The endpoint is bidirectional: host →
+device traffic (memory writes, kernel launches) flows one way, and
+device-side outbound traffic (cross-SIP collective sends) flows the
+other.
+
+<!-- src: ADR-0038 Decision, Alternatives Considered, Consequences -->
+A more detailed PCIe model was considered and rejected. The simulator
+is targeting system-level latency comparisons; making the endpoint
+heavier with credit-management and retry logic would not improve the
+metrics being studied. The decision keeps the endpoint as the
+documented protocol-boundary node, named consistently so routing
+helpers can locate it by SIP and IO instance.
+
+### pe_cpu
+
+<!-- src: ADR-0014 Decision -->
+The PE control CPU is the entry point for kernel work arriving from
+the cube's management CPU. It receives kernel-launch messages, resolves
+the kernel function by name, and hands execution to the scheduler with
+the resolved tensor arguments. From the scheduler's point of view, the
+PE_CPU is the upstream source of high-level commands; from the rest
+of the system's point of view, the PE_CPU is where a kernel's
+execution begins on a given PE.
+
+### pe_dma
+
+<!-- src: ADR-0014 Decision, ADR-0023 Decision -->
+The DMA engine on each PE has two distinct modes. In the standard PE
+pipeline it consumes tile tokens issued by the scheduler, acquires a
+read or write channel (modeled as a one-in-flight resource per
+direction), and runs the bytes to or from HBM through the mesh. In
+its collective mode it forwards send tokens for the cube's IPCQ into
+the fabric, snapshotting the source data at send time so later
+mutations cannot race the receiver's read. Both modes share the same
+channel resources but differ in their downstream handling — one
+returns when the round-trip completes, the other dispatches
+fire-and-forget.
+
+### pe_fetch_store
+
+<!-- src: ADR-0014 Decision -->
+The fetch-store engine is the bridge between the on-PE scratchpad
+(TCM) and the register file. It does not run DMA; it only moves bytes
+internally. On receiving a tile-stage token it sends a short request
+to the TCM, waits for the bandwidth-serialized delay, and continues
+the pipeline. The split between this engine and the TCM lets the
+scratchpad model its own read/write bandwidth independently.
+
+### pe_gemm
+
+<!-- src: ADR-0014 Decision -->
+The GEMM engine is the matrix-multiply compute unit. Tile tokens
+arriving at this stage carry the per-tile dimensions, and the engine
+contributes a service time accounting for one fused multiply-add over
+the tile's macs. Composite operations (where the same tensor pair is
+streamed across many tiles) reuse the engine through the scheduler;
+the engine itself is stateless between tiles.
+
+### pe_ipcq
+
+<!-- src: ADR-0023 Context, Decision -->
+The IPCQ — inter-process communication queue — is each PE's
+collective-communication endpoint. It owns ring buffers that hold
+inbound messages from neighbor PEs and bookkeeping for send credits.
+Direction names ("N", "S", "E", "W" for cube-internal neighbors and
+"global_*" for cross-SIP neighbors) are resolved to physical peer
+endpoints by a neighbor table installed at process-group creation
+time. The component itself does not move bytes — it issues DMA tokens
+through the local PE_DMA, which performs the actual cross-PE
+transfer.
+
+<!-- src: ADR-0023 Decision, Consequences -->
+A key invariant is that the inbound terminal — where data lands at
+the receiver — pays the link bandwidth drain plus any cube-internal
+mesh hop to the slot's backing memory. This prevents IPCQ from
+silently outpacing raw DMA at large transfer sizes. Outbound sends
+are fire-and-forget; credit return is the only backpressure signal.
+
+### pe_math
+
+<!-- src: ADR-0014 Decision -->
+The math engine handles element-wise and reduction operations. It
+consumes tile tokens carrying an operation kind (`exp`, `sum`, `max`,
+`where`, etc.) and contributes a service time proportional to the
+number of elements processed. Like the GEMM engine it is stateless;
+chained epilogues (a sequence of math operations after a GEMM tile)
+are scheduled as separate stages.
+
+### pe_mmu
+
+<!-- src: ADR-0039 Context, Decision -->
+The MMU has two roles, exposed through one component. As a node on
+the cube NOC it receives MMU-map and MMU-unmap messages and updates
+its internal page table, so that the runtime API can install
+virtual-to-physical mappings with measured fabric latency. As a
+utility object held inside the PE it offers synchronous translate
+calls to the PE's DMA and GEMM engines without taking simulator time
+itself; the calling engine pays any configured TLB overhead in its
+own process.
+
+<!-- src: ADR-0039 Decision, Alternatives Considered -->
+The page table supports multiple disjoint regions inside a single
+page, with later-write-wins semantics on overlap. This is a deliberate
+simulator stopgap to support parallelization policies that shard data
+at sub-page granularity without silent mis-routing through a real
+hardware MMU's one-PA-per-entry assumption. A real MMU does not work
+this way; the model documents this as a simplification.
+
+### pe_scheduler
+
+<!-- src: ADR-0014 Decision -->
+The scheduler is the sole dispatcher inside a PE. Simple commands are
+routed directly to the right engine. Composite commands generate a
+tile plan, and the resulting tile tokens are fed into the pipeline.
+Self-routing keeps the scheduler off the per-stage hot path: each
+engine, on finishing a stage, advances the token to the next stage's
+component itself, so the scheduler only does initial dispatch and
+completion tracking.
+
+### pe_tcm
+
+<!-- src: ADR-0040 Context, Decision -->
+The TCM is the per-PE tightly-coupled scratchpad memory. It models
+time only, not data — the actual payload lives in the simulator's
+memory store. Read and write are independent channels: each is
+modeled as a one-in-flight resource, so same-direction requests
+serialize but a read and a write can overlap. The bandwidth of each
+direction is configured separately and applied as bytes-over-bandwidth
+on each request.
+
+<!-- src: ADR-0040 Decision, Alternatives Considered -->
+The decision to keep read and write on separate channels was made
+because the PE pipeline's normal case overlaps fetch (read) and store
+(write). Collapsing them into a single shared channel would have
+artificially serialized that overlap and produced an incorrect
+bandwidth ceiling.
+
+### sram
+
+<!-- src: ADR-0041 Context, Decision -->
+The cube SRAM is a per-cube scratchpad attached to one of the cube's
+routers. As a node it applies a configured access overhead, pays the
+link-bandwidth drain stamped on the incoming request, and sends a
+response on the reverse path. It is a terminal — it does not forward.
+
+<!-- src: ADR-0041 Decision, Consequences -->
+A second role is as one of three backing-memory tiers (TCM, SRAM, HBM)
+that an inter-PE collective slot can live in. When the slot lives in
+SRAM, the PE_DMA pays the slot read or write latency directly using
+the configured SRAM bandwidth and overhead; the SRAM component does
+not need to know about collective semantics. This separation keeps
+the SRAM component agnostic to the collective subsystem.
+
+### tiling
+
+<!-- src: ADR-0042 Context, Decision -->
+The tile-plan generator is not a runtime component — it is a pure
+module of functions that take a problem shape (matrix dimensions, tile
+sizes) and produce an ordered list of tile-stage sequences. The
+scheduler consumes this list. Each tile's stage sequence depends on
+how its operands are staged: operands streamed from HBM produce
+DMA_READ stages, operands already resident in TCM (because they were
+loaded eagerly upfront) skip them.
+
+<!-- src: ADR-0042 Decision, Consequences -->
+The plan generator is intentionally pure — given the same input it
+returns the same plan, with no simulator events created. This lets
+the rest of the system reason about tile sequences as data, and it
+makes the plan testable in isolation without simulator state. New
+plan variants (for example, K-major or DTensor-aware plans) can be
+added as new functions following the same shape.
+
+---
+
+## Implementation Decisions
+
+This section collects cross-cutting decisions — algorithms, policies,
+schemes, and contracts — that span multiple components rather than
+living inside one.
+
+### Address Scheme
+
+<!-- src: ADR-0001 Context, Decision -->
+Every physical address in the simulator decodes into a structured
+location. A fixed-width physical address carries the SIP id, the
+cube id within the SIP, a type discriminator (HBM vs PE-resource vs
+others), and a type-specific offset. HBM addresses additionally encode
+the per-PE slice offset so the controller can determine which PE
+owns the target slice without external lookup. The layout is
+deliberately reserved rather than packed-to-fit, so new sub-units can
+be added at the type-discriminator level without rewriting existing
+addresses.
+
+<!-- src: ADR-0011 Context, Decision -->
+On top of physical addressing, the simulator supports three address
+models that the runtime API selects between. Direct physical
+addressing is retained as a fallback. Virtual addressing — the
+current default — gives each tensor a contiguous virtual range at
+deployment, with the per-PE MMU translating per access; an
+alternative logical-address scheme remains a future option. The
+virtual-address path is what every modern test path takes; the PA
+fallback is used by the MMU itself when no mapping exists for an
+address (a deliberate signal, not an error).
+
+<!-- src: ADR-0011 Decision, Consequences -->
+Tensor placement is represented as a list of physical-address shards,
+each tagged with target SIP, cube, and PE, plus a single tensor-wide
+virtual base. This means a kernel sees one virtual base for the whole
+tensor while the host driver and the engine still know exactly where
+each shard lives. Replicated tensors get per-cube local PA mappings;
+sharded tensors broadcast their mapping across cubes within a SIP.
+
+### Routing, Distance & Helper API
+
+<!-- src: ADR-0002 Context, Decision -->
+Routing is policy-driven, deterministic, and topology-aware. Given a
+source, a destination, and an intent — for example, PE-initiated
+DMA versus host-initiated memory write versus a generic
+component-to-component query — the routing layer picks the right
+path. The intent matters because different traffic types must avoid
+different categories of edges: PE-initiated DMA should not traverse
+command-only links; M_CPU DMA should not pass through PE-internal
+pipeline edges; cube-local transfers should not use the
+zero-distance UCIe bus that would otherwise look attractive to a
+shortest-path search.
+
+<!-- src: ADR-0051 Decision -->
+The routing layer therefore maintains four separate adjacency graphs
+at construction, each excluding a different category of edges, and
+picks the appropriate one per intent. On top of the graphs sits a
+helper API that hides the topology's naming convention: callers ask
+for the PCIe endpoint of a given SIP, the M_CPU of a given cube, or
+the HBM destination for a given physical address, and receive the
+corresponding node id. No component constructs node-id strings
+directly; if the naming convention ever changes, the change is local
+to the helper layer.
+
+<!-- src: ADR-0051 Decision, Consequences -->
+Path-finding itself uses Dijkstra with explicit per-edge weights
+(routing weight is allowed to differ from physical distance — for
+example, UCIe is configured to be routing-preferable). Tie-breaks
+follow insertion order, which keeps results deterministic. Paths
+between unreachable nodes raise rather than returning empty, surfacing
+topology errors immediately.
+
+### Memory Semantics and Local-HBM Bandwidth
+
+<!-- src: ADR-0004 Context, Decision -->
+A PE accessing its own HBM slice through its own cube's NOC must see
+the full local HBM bandwidth — that is the model's intent. Memory
+traffic accumulates latency from per-component overhead and
+bytes-over-link-bandwidth serialization along the path, but the
+controller does not throttle below the slice's allotted bandwidth.
+Cross-PE-slice accesses inside the same cube, cross-cube accesses
+through UCIe, and cross-SIP accesses through PCIe each pay
+progressively more overhead as the path grows.
+
+### Topology Compilation, Diagrams & Builder Algorithms
+
+<!-- src: ADR-0006 Context, Decision -->
+Topology is configurable, not hardcoded. The simulator reads a YAML
+spec, compiles it into a flat graph of nodes and edges plus four
+view projections at different abstraction levels — system, SIP, cube,
+PE — and uses the compiled graph as the single source for both
+execution and visualization. Distance metadata used by routing is
+extracted at compile time so that diagrams and routing decisions
+agree by construction.
+
+<!-- src: ADR-0005 Context, Decision -->
+Diagrams are derived artifacts of the compiled topology. The visualizer
+produces one SVG per view at the appropriate abstraction level; nothing
+in the diagrams is hand-drawn or hand-positioned. Distance-aware
+layout rules place nodes in the diagrams using the same coordinates
+that routing uses to compute distance, so a diagram that "looks
+wrong" is a signal that the topology itself has a problem, not the
+visualizer.
+
+<!-- src: ADR-0053 Decision -->
+Inside a cube the router mesh is generated automatically. PE corner
+positions are fixed by convention; the relay-column algorithm
+inserts additional grid columns whenever the gap between adjacent PE
+columns would exceed a tunable maximum. HBM occupies a central
+exclusion zone — router slots inside the zone are deliberately empty,
+since HBM controllers attach as separate named nodes. M_CPU and SRAM
+attach to the nearest router by Euclidean distance from their
+configured placement coordinates, and UCIe physical lanes distribute
+along the boundary rows and columns. The whole mesh is cached
+beside the topology spec and invalidated only when one of a small set
+of layout-relevant fields changes.
+
+<!-- DIAGRAM: One cube's router mesh — rows × cols of routers with HBM exclusion zone in the middle, PEs/M_CPU/SRAM attaching to nearest routers, UCIe PHYs along the perimeter. -->
+
+### Tensor Deployment and Allocation
+
+<!-- src: ADR-0008 Context, Decision -->
+Tensor deployment in the runtime API produces a list of physical-address
+shards plus a single tensor-wide virtual base. The host allocator
+walks the data-parallelism policy, computes per-shard placement, and
+emits the per-shard physical addresses through the per-PE allocators.
+No separate "allocate then later attach to a device" RPC exists —
+allocation and deployment are a single operation that produces a
+deployed tensor handle.
+
+### Memory Allocator Algorithms
+
+<!-- src: ADR-0048 Context, Decision -->
+Each per-PE allocator owns two channels — HBM slice and TCM — each
+backed by an offset-keyed free-list. Allocation is first-fit; freeing
+coalesces with adjacent free blocks. A device-wide virtual allocator
+sits above the per-PE allocators, aligns requests up to the configured
+page size, and coalesces on free in the same way. The trade-off is
+explicit: first-fit is simpler and cheaper than best-fit or buddy
+allocation, and the simulator's workload is stack-like enough
+(deploy / kernel / free in matched order) that fragmentation is not
+a practical concern.
+
+<!-- src: ADR-0048 Decision, Consequences -->
+Allocation failure raises rather than silently returning a partial
+result. A partial tensor reaching the engine would route over wrong
+PAs and silently corrupt simulator output, so an out-of-memory signal
+is preferred. The free path trusts its caller to pass back exactly
+what was allocated; the small risk of caller error in exchange for
+fast common-case freeing is documented as a deliberate trade.
+
+### Kernel Execution and Host-Device Messaging
+
+<!-- src: ADR-0009 Context, Decision -->
+Kernel execution decomposes into a small set of messages that travel
+the device graph. The host issues a single kernel-launch message; the
+IO_CPU fans it out per-cube; the cube M_CPU fans it out per-PE; the
+PE CPU resolves the kernel and runs it through the scheduler.
+Completion flows back the same way, gated by per-shard completion
+tracking. Memory operations follow the same pattern: a memory write
+or read travels as one message that the engine routes to the right
+HBM controller, with a response taking the reverse path.
+
+<!-- src: ADR-0012 Context, Decision -->
+The schema between the host and the device-side IO CPU is PA-first
+and shard-tagged. Every byte of host-issued payload arrives with an
+explicit target SIP, cube, PE, and physical address. The IO_CPU does
+not decode addresses to derive placement — placement is named
+explicitly by the shard list. This makes the host-device interface
+deterministic and keeps the routing helper free of host-derived
+intent.
+
+### CLI Surface and Semantics
+
+<!-- src: ADR-0010 Context, Decision -->
+The command-line interface exposes four subcommands. A bench runner
+loads a topology, resolves a registered benchmark by name or index,
+and runs it on a selected device. A bench-listing command enumerates
+the registered benchmarks. A probe utility runs a fixed catalog of
+traffic patterns through the engine for latency and bandwidth
+verification. A web viewer renders the topology in a browser. A
+benchmark instance is always single-device by convention; multi-SIP
+collective work happens inside the benchmark through the launcher
+abstraction, not by multiplexing the CLI.
+
+### Component Port and Wire Fabric Model
+
+<!-- src: ADR-0015 Context, Decision -->
+Every modeled component exposes input and output ports, and every
+edge in the topology connects an output port on one component to an
+input port on another. Bandwidth and propagation delay are properties
+of the wire between ports, not of the component endpoints. A
+component's responsibility is to apply its configured per-node
+overhead and either forward to the next hop or terminate; the wire
+charges the byte-over-bandwidth serialization separately.
+
+<!-- src: ADR-0015 Decision, Consequences -->
+This separation lets components be swapped behind their port
+interface without changing the rest of the model, and it keeps
+bandwidth contention at the wire level where multiple components may
+contend for the same edge. Future component models can refine
+internal behavior without disturbing the fabric.
+
+### Two-Pass Data Execution
+
+<!-- src: ADR-0020 Context, Decision -->
+The simulator runs in two passes. The first pass — fast and always
+on — runs the discrete-event engine and records every data operation
+in an operation log with timestamps, component identifiers, and per-
+operation parameters. The second pass — optional, opt-in — replays
+the log against an in-memory tensor store to produce actual numerical
+results. Tests that only need timing skip the second pass; tests that
+need to verify correctness opt in.
+
+<!-- src: ADR-0020 Decision, Consequences -->
+The split lets the timing engine remain unconcerned with data
+semantics: kernels move handles around, not bytes. The replay phase
+recovers data semantics from the recorded operations, in their
+original time order with a small set of secondary-sort rules. The
+op-log records carry enough metadata — input snapshots for compute
+operations, source snapshots for cross-component copies — that the
+replay phase cannot mis-order with respect to in-flight mutations.
+
+### Sim-engine Op Log and Memory Store Schemas
+
+<!-- src: ADR-0052 Context, Decision -->
+The operation log holds typed records with seven fields each: start
+and end timestamps, the component that issued the operation, an
+operation kind ("memory", "gemm", "math"), an operation name, a
+parameter dictionary, and a (currently unused) dependency list.
+Records are kept in stable timestamp order. The parameter dictionary
+varies by operation: a DMA read carries source address and byte count;
+a GEMM carries operand shapes, dtypes, and address spaces; a math
+operation carries input addresses and snapshots.
+
+<!-- src: ADR-0052 Decision, Consequences -->
+The companion memory store is a two-level dictionary keyed by
+address space ("hbm", "tcm", "sram", others) and integer address.
+Reads and writes are reference-based — no copy by default — so
+callers wanting to detach a snapshot must copy explicitly. This is
+deliberate: the engine-internal snapshot paths copy at well-defined
+points (math input capture, HBM source capture for DMA writes,
+inbound collective copies) and downstream replay code therefore
+sees stable data even when slot or scratch addresses are reused by
+later operations.
+
+### 2D Grid Program Identity
+
+<!-- src: ADR-0022 Context, Decision -->
+Inside a kernel the program identity is two-dimensional. The
+first axis corresponds to the PE index within a cube; the second
+corresponds to the cube index within a SIP. Together they let a
+kernel address its position both within its cube and within the
+larger system without needing to know the full topology. Total
+program counts along each axis are exposed symmetrically.
+
+### Parallelism — SIP Launcher, DPPolicy, Megatron-TP, AHBM Backend, and CCL Algorithm Module
+
+<!-- src: ADR-0024 Context, Decision -->
+The launcher model treats each SIP as one rank. Inside a process the
+launcher spawns one greenlet per SIP rank; the rank is bound to its
+greenlet so that any code running in that worker sees the right
+distributed-style rank. This is a deliberately PyTorch-compatible
+shape: a benchmark looks like a small DDP training script — initialize
+a process group, spawn workers, each worker runs the same body.
+
+<!-- src: ADR-0026 Context, Decision -->
+Data-parallelism policy lives in a single object that names the
+sharding strategy along the cube axis (replicate, row-wise,
+column-wise) and along the PE axis (same set of values), and optionally
+overrides the number of cubes or PEs participating. The policy is
+intra-device — it does not cross SIP boundaries. SIP-level parallelism
+is the launcher's responsibility, and the two axes compose
+orthogonally.
+
+<!-- src: ADR-0027 Context, Decision -->
+A Megatron-style tensor-parallel API sits on top of the launcher and
+the DP policy. Layer-level building blocks — column-parallel linear,
+row-parallel linear, all-reduce — name their sharding intent in terms
+the launcher and the placement policy can compose. This is the layer
+that bench code typically writes against.
+
+<!-- src: ADR-0047 Context, Decision -->
+For collective operations the runtime exposes a PyTorch-compatible
+distributed backend named "ahbm". On process-group initialization the
+backend loads the configured collective-algorithm module, resolves
+the world size (priority: explicit ccl.yaml override → defaults
+section → topology SIP count), imports the algorithm module
+dynamically, derives the SIP topology kind, and pushes the inter-PE
+neighbor table to every participating PE. From that point on, an
+all-reduce call dispatches the algorithm's kernel function across
+all ranks.
+
+<!-- src: ADR-0050 Context, Decision -->
+A collective-algorithm module is a Python module with a small, fixed
+contract. It exposes topology-kind integer constants, a name-to-kind
+mapping for the YAML configuration, a kernel-arguments builder, and
+a kernel function — the kernel function being aliased to the name
+`kernel` so the backend can find it generically. The kernel itself
+takes the tensor pointer, the per-cube element count, cube mesh
+width and height, the world size, the current rank, and the SIP
+topology dimensions; the backend appends those last four arguments
+automatically. New collectives slot in by adding a new module that
+follows this shape.
+
+<!-- src: ADR-0027 Decision, Consequences -->
+The combination is deliberate: bench authors get to write code that
+looks like a regular distributed training script, while the launcher,
+backend, and placement policies behind it remain free to redirect
+work to the right SIP, cube, and PE without exposing topology to the
+kernel.
+
+### IPCQ Direction Addressing
+
+<!-- src: ADR-0025 Context, Decision -->
+Inside a collective algorithm, peer PEs are named by direction —
+"N", "S", "E", "W" for cube-internal neighbors, and "global_*" for
+cross-SIP neighbors. Direction addressing is the addressing scheme:
+the algorithm names a direction, the IPCQ neighbor table installed
+at process-group time resolves the direction to the peer endpoint's
+physical-address coordinates, and the PE_DMA performs the actual
+transfer. The algorithm itself does not see PA arithmetic — direction
+is the user-facing handle.
+
+### Intercube All-Reduce
+
+<!-- src: ADR-0032 Context, Decision -->
+The default all-reduce algorithm uses a center-rooted bidirectional
+phase inside each SIP's cube mesh followed by an inter-SIP exchange
+on the mesh's root cube, and then a bidirectional broadcast back
+out. Center-rooting halves the in-cube hop count compared with a
+corner-rooted walk. The inter-SIP exchange itself follows the
+configured SIP topology — ring, torus, or non-wrapping mesh —
+selected at runtime through the SIP-topology kind integer the
+backend passes to the kernel.
+
+### Evaluation Harnesses
+
+<!-- src: ADR-0043 Context, Decision -->
+The all-reduce evaluation harness drives correctness and the
+latency/buffer-kind sweeps through the public distributed path —
+initialize process group, spawn workers, call all-reduce — rather
+than the lower-level engine interface. A shared helper module factors
+out the setup; sweep tests cover the buffer-kind tiers (TCM, SRAM,
+HBM) and the inter-SIP topology variants. The plots produced by the
+harness are part of its output contract; the harness regenerates them
+on demand.
+
+<!-- src: ADR-0044 Context, Decision -->
+The GEMM evaluation harness is split into two layers. A heavy
+shape-and-variant sweep lives as a manual script — it runs the same
+composite-GEMM benchmark across many shapes and operand-staging
+variants, harvests the resulting op-log, and writes a JSON summary.
+A faster figure-generation layer lives in the test suite and consumes
+that JSON to render plots. The split keeps the heavy data
+generation explicit and out of the regular test path.
+
+### Bench Module Contract
+
+<!-- src: ADR-0045 Context, Decision -->
+Adding a new benchmark requires only dropping a file into the
+benchmarks directory. The file registers one or more benchmark
+functions through a small decorator that takes a kebab-case name and
+a human-readable description. The decorator is the registration
+mechanism — there is no separate manifest. Each benchmark function
+takes one argument, conventionally named `torch`, which is the
+runtime context exposing tensor allocation, kernel launch,
+distributed APIs, and process-spawning. The function name is `run` by
+convention.
+
+<!-- src: ADR-0045 Decision, Consequences -->
+A benchmark must submit at least one operation, or the runner
+returns an error. A benchmark instance is single-device by default;
+when a benchmark is collective, it uses the distributed-process-spawn
+pattern internally — one worker greenlet per rank, with each worker
+binding to its rank. Multi-device benchmark patterns outside that
+shape are not supported.
+
+### Kernel-side `tl.*` API
+
+<!-- src: ADR-0046 Context, Decision -->
+Inside a kernel function, the `tl` argument exposes the kernel-side
+API in a shape that mirrors the conventions of established
+GPU-kernel languages. Categories: reference handles that name HBM
+data without issuing DMA; data movement (load, store) that does
+issue DMA; GEMM and math compute (dot, composite, the unary and
+binary math operations, reductions); index and scalar helpers
+(program identity, range-builders); metadata-only operations like
+transpose; and the collective primitives (send, receive,
+non-blocking receive). Tensor handles support arithmetic operators
+via a thread-local active context so kernel code reads naturally.
+
+<!-- src: ADR-0046 Decision, Consequences -->
+The API supports two execution modes. A command-list mode records
+operations into a list without consuming simulator time — useful for
+inspection and lightweight tests. A greenlet-driven mode runs the
+kernel as a child greenlet that switches back to the simulator on
+each `tl.*` call; the simulator drives the event scheduler and hands
+real data back to the kernel as DMA reads complete. The two modes
+share the same surface; the kernel does not know which one it is
+running under.
+
+### Probe Subcommand
+
+<!-- src: ADR-0049 Context, Decision -->
+The probe utility runs three families of traffic patterns through
+the engine — host-to-device writes at increasing hop counts,
+device-to-host reads at increasing hop counts, and PE-initiated DMA
+across the cube mesh — and reports actual latency, the analytical
+formula breakdown, effective bandwidth, bottleneck bandwidth, and
+utilization. A fixed reference size is used for the summary table;
+a separate utilization-versus-size sweep covers a logarithmic range
+of transfer sizes. Each case runs in its own engine instance so
+cases do not perturb each other.
+
+<!-- src: ADR-0049 Decision, Consequences -->
+The probe also checks a small set of invariants automatically:
+monotonic latency increase with hop count, device-to-host latency
+at least as large as host-to-device for the same hop count, and a
+faster best-case path than worst-case for cross-cube PE DMA. Failures
+print prominently. The output is meant for human reading; automated
+parsing should not depend on column widths or whitespace.
+
+---
+
+This document summarizes 46 architecture decisions captured during
+the first half of 2026. It is regenerated mechanically from the
+decision corpus; sources are recorded in HTML comments throughout.
@@ -1,237 +1,20 @@
 """Sweep GEMM shapes through kernbench and dump PE_accelerator engine times.

-For each shape:
-  - run benches.matmul_composite via the same run_bench path the CLI uses
-  - read result.engine.op_log
-  - filter to per-PE engines: pe_dma, pe_fetch_store, pe_gemm, pe_math
-  - record sum-of-durations (engine occupancy) AND wall-clock active interval
+Thin wrapper: the sweep logic now lives in
+``kernbench.benches.milestone_1h_gemm`` (the single home, ADR-0054, also the
+``milestone-1h-gemm`` bench). This script remains the manual entry point for
+regenerating ``docs/diagrams/gemm_sweep.json`` on demand and honors the same
+``SWEEP_SHAPES`` / ``SWEEP_TOPOLOGY`` env overrides.

-Output: docs/diagrams/gemm_sweep.json
+    python scripts/gemm_sweep.py
 """
 from __future__ import annotations

-import json
-import os
-import sys
-import time
-from pathlib import Path
-
-# Default sweep covering under-tile, single-tile, multi-tile, and asymmetric regimes.
-# Each entry is either a single integer (square M=K=N=S) or "MxKxN".
-# Override via env: SWEEP_SHAPES="16,32,16x2048x16,..."
-DEFAULT_SHAPES = [
-    "32x32x32",       # 1 tile, K=32 < TILE_K=64 → under-tile in K
-    "32x64x32",       # 1 tile, exact single-tile fit
-    "32x128x32",      # 2 tiles, aligned
-    "32x128x128",     # 8 tiles, aligned
-    "32x3072x32",     # 48 tiles, all K-axis (tall-skinny)
-    "8x128x128",      # 8 tiles, but M=8 < TILE_M=32 → MAC array under-fed
-    "128x8x128",      # 16 tiles, but K=8 < TILE_K=64 → MAC array under-fed
-    "512",            # 2048 tiles, fully aligned — "well-pipelined" reference
-]
-
-# Operand-staging variants exercised per shape.
-VARIANTS = ["ref_ref", "load_ref", "load_load"]
-
-# Engines whose timings we collect (component_id suffix match).
-ENGINES = ["pe_dma", "pe_fetch_store", "pe_gemm", "pe_math"]
-
-# Per-stage breakdown labels (StageType enum names from pe_types.py).
-STAGES = ["DMA_READ", "DMA_WRITE", "FETCH", "STORE", "GEMM", "MATH"]
-
-# Scheduler tile sizes (mirror of PeSchedulerComponent.TILE_M/K/N).
-TILE_M, TILE_K, TILE_N = 32, 64, 32
-
-OUT_PATH = Path(__file__).parent.parent / "docs" / "diagrams" / "gemm_sweep.json"
-
-
-def _engine_wall_ns(records, suffix: str) -> float:
-    """Wall-clock interval the engine was active (union of overlapping ops)."""
-    intervals = [(r.t_start, r.t_end) for r in records
-                 if r.component_id.endswith("." + suffix)]
-    if not intervals:
-        return 0.0
-    intervals.sort()
-    merged_end = intervals[0][1]
-    merged_start = intervals[0][0]
-    total = 0.0
-    for s, e in intervals[1:]:
-        if s <= merged_end:
-            merged_end = max(merged_end, e)
-        else:
-            total += merged_end - merged_start
-            merged_start, merged_end = s, e
-    total += merged_end - merged_start
-    return total
-
-
-def _engine_occupancy_ns(records, suffix: str) -> float:
-    return sum(r.t_end - r.t_start for r in records
-               if r.component_id.endswith("." + suffix))
-
-
-def _engine_count(records, suffix: str) -> int:
-    return sum(1 for r in records if r.component_id.endswith("." + suffix))
-
-
-def _stage_occupancy_ns(records, stage_type: str) -> float:
-    """Sum t_end - t_start over op_log records whose params.stage_type matches.
-
-    Requires op_log records produced post the TileToken stage_type capture
-    (sim_engine/op_log.py).
-    """
-    return sum(
-        r.t_end - r.t_start
-        for r in records
-        if r.params.get("stage_type") == stage_type
-    )
-
-
-def _stage_wall_ns(records, stage_type: str) -> float:
-    """Interval-union wall-clock for records whose stage_type matches."""
-    intervals = sorted(
-        (r.t_start, r.t_end) for r in records
-        if r.params.get("stage_type") == stage_type
-    )
-    if not intervals:
-        return 0.0
-    total = 0.0
-    cs, ce = intervals[0]
-    for s, e in intervals[1:]:
-        if s <= ce:
-            ce = max(ce, e)
-        else:
-            total += ce - cs
-            cs, ce = s, e
-    total += ce - cs
-    return total
-
-
-def _stage_count(records, stage_type: str) -> int:
-    return sum(1 for r in records if r.params.get("stage_type") == stage_type)
-
-
-def _run_one(M: int, K: int, N: int, topology: str, variant: str = "ref_ref") -> dict:
-    os.environ["MATMUL_M"] = str(M)
-    os.environ["MATMUL_K"] = str(K)
-    os.environ["MATMUL_N"] = str(N)
-    os.environ["MATMUL_VARIANT"] = variant
-
-    # Late imports so env vars are read by matmul_composite at module load.
-    # Force re-import to pick up new env values.
-    for mod_name in [m for m in list(sys.modules) if m.startswith("kernbench.benches.matmul_composite")]:
-        del sys.modules[mod_name]
-
-    from kernbench.benches.registry import resolve as resolve_bench
-    from kernbench.runtime_api.bench_runner import run_bench
-    from kernbench.runtime_api.types import resolve_device
-    from kernbench.sim_engine.engine import GraphEngine
-    from kernbench.topology.builder import resolve_topology
-
-    topo = resolve_topology(topology)
-    bench = resolve_bench("matmul-composite").run
-    device = resolve_device(None)
-
-    t0 = time.time()
-    result = run_bench(
-        topology=topo, bench_fn=bench, device=device,
-        engine_factory=lambda t, d: GraphEngine(
-            getattr(t, "topology_obj", t), enable_data=True,
-        ),
-    )
-    wall = time.time() - t0
-
-    op_log = result.engine.op_log
-    if not result.completion.ok:
-        raise RuntimeError(f"bench failed at M={M},K={K},N={N}: {result.completion}")
-
-    # Bytes touched at f16 (2 B): full A + full B + full out (each operand
-    # streamed once through HBM by the composite plan).
-    bytes_total = (M * K + K * N + M * N) * 2
-    row = {
-        "M": M, "K": K, "N": N,
-        "variant": variant,
-        "flops": 2 * M * K * N,
-        "bytes_hbm": bytes_total,
-        "arith_intensity": (2 * M * K * N) / bytes_total,  # flops/byte
-        "tile_count_expected": _ceil(M, TILE_M) * _ceil(N, TILE_N) * _ceil(K, TILE_K),
-        "sim_wall_clock_s": round(wall, 3),
-        "engines": {},
-    }
-    for eng in ENGINES:
-        row["engines"][eng] = {
-            "occupancy_ns": _engine_occupancy_ns(op_log, eng),
-            "wall_ns":      _engine_wall_ns(op_log, eng),
-            "record_count": _engine_count(op_log, eng),
-        }
-    row["stages"] = {}
-    for stage in STAGES:
-        row["stages"][stage] = {
-            "occupancy_ns": _stage_occupancy_ns(op_log, stage),
-            "wall_ns":      _stage_wall_ns(op_log, stage),
-            "record_count": _stage_count(op_log, stage),
-        }
-    # Kernel-window wall-clock = max t_end - min t_start over PE engine records.
-    pe_records = [r for r in op_log
-                  if any(r.component_id.endswith("." + e) for e in ENGINES)]
-    if pe_records:
-        row["pe_window_ns"] = max(r.t_end for r in pe_records) \
-                              - min(r.t_start for r in pe_records)
-    else:
-        row["pe_window_ns"] = 0.0
-    stage_records = [r for r in op_log
-                     if r.params.get("stage_type") in STAGES]
-    if stage_records:
-        row["composite_window_ns"] = max(r.t_end for r in stage_records) \
-                                     - min(r.t_start for r in stage_records)
-    else:
-        row["composite_window_ns"] = 0.0
-    return row
-
-
-def _ceil(a: int, b: int) -> int:
-    return (a + b - 1) // b
+from kernbench.benches.milestone_1h_gemm import run_sweep


 def main() -> int:
-    shapes_env = os.environ.get("SWEEP_SHAPES")
-    raw = (shapes_env.split(",") if shapes_env else DEFAULT_SHAPES)
-    shapes: list[tuple[int, int, int]] = []
-    for s in raw:
-        s = s.strip()
-        if not s:
-            continue
-        if "x" in s.lower():
-            parts = s.lower().split("x")
-            shapes.append((int(parts[0]), int(parts[1]), int(parts[2])))
-        else:
-            v = int(s)
-            shapes.append((v, v, v))
-    topology = os.environ.get("SWEEP_TOPOLOGY", "topology.yaml")
-
-    rows = []
-    for M, K, N in shapes:
-        for variant in VARIANTS:
-            print(f"[sweep] M={M} K={K} N={N} variant={variant} ...", flush=True)
-            row = _run_one(M, K, N, topology, variant=variant)
-            rows.append(row)
-            eng_dma = row["engines"]["pe_dma"]
-            eng_gem = row["engines"]["pe_gemm"]
-            print(f"   tiles={row['tile_count_expected']:>6}  "
-                  f"pe_window={row['pe_window_ns']:8.1f}ns  "
-                  f"dma_occ={eng_dma['occupancy_ns']:9.1f}  "
-                  f"gemm_occ={eng_gem['occupancy_ns']:8.1f}  "
-                  f"(sim {row['sim_wall_clock_s']:.1f}s)")
-
-    OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
-    OUT_PATH.write_text(json.dumps({
-        "tile_sizes": {"M": TILE_M, "K": TILE_K, "N": TILE_N},
-        "engines": ENGINES,
-        "stages": STAGES,
-        "variants": VARIANTS,
-        "rows": rows,
-    }, indent=2))
-    print(f"\n[sweep] wrote {OUT_PATH}")
+    run_sweep()
    return 0


@@ -0,0 +1,13 @@
+buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns
+hbm,torus_2d,6,128,256,2120.040000000012
+hbm,torus_2d,6,1024,2048,2717.2783333333473
+hbm,torus_2d,6,8192,16384,7315.184999999989
+hbm,torus_2d,6,32768,65536,23081.26500000037
+sram,torus_2d,6,128,256,2060.040000000012
+sram,torus_2d,6,1024,2048,2909.2783333333473
+sram,torus_2d,6,8192,16384,9523.184999999869
+sram,torus_2d,6,32768,65536,32201.265000000385
+tcm,torus_2d,6,128,256,1964.040000000012
+tcm,torus_2d,6,1024,2048,2477.2783333333473
+tcm,torus_2d,6,8192,16384,6403.185000000109
+tcm,torus_2d,6,32768,65536,19865.265000000378
@@ -0,0 +1,37 @@
+algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,8,16,256,2666.552500000015
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,32,64,1024,2747.7400000000152
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,64,128,2048,2855.990000000018
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,128,256,4096,3072.490000000019
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3337.1133333333582
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3708.0333333333692
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4449.873333333393
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,5933.020000000124
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,8900.379999999863
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,14835.099999999224
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,26704.540000000765
+lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,38573.97999999701
+lrab_hierarchical_allreduce,ring_1d,6,8,16,256,2365.255833333347
+lrab_hierarchical_allreduce,ring_1d,6,32,64,1024,2436.9433333333473
+lrab_hierarchical_allreduce,ring_1d,6,64,128,2048,2532.526666666683
+lrab_hierarchical_allreduce,ring_1d,6,128,256,4096,2723.693333333349
+lrab_hierarchical_allreduce,ring_1d,6,512,1024,16384,3048.635000000021
+lrab_hierarchical_allreduce,ring_1d,6,1024,2048,32768,3393.4016666666957
+lrab_hierarchical_allreduce,ring_1d,6,2048,4096,65536,4082.401666666714
+lrab_hierarchical_allreduce,ring_1d,6,4096,8192,131072,5458.80166666677
+lrab_hierarchical_allreduce,ring_1d,6,8192,16384,262144,8216.934999999943
+lrab_hierarchical_allreduce,ring_1d,6,16384,32768,524288,13733.201666665835
+lrab_hierarchical_allreduce,ring_1d,6,32768,65536,1048576,24765.73500000064
+lrab_hierarchical_allreduce,ring_1d,6,49152,98304,1572864,35798.268333331536
+lrab_hierarchical_allreduce,torus_2d,6,8,16,256,1700.6025000000095
+lrab_hierarchical_allreduce,torus_2d,6,32,64,1024,1753.2900000000102
+lrab_hierarchical_allreduce,torus_2d,6,64,128,2048,1823.540000000012
+lrab_hierarchical_allreduce,torus_2d,6,128,256,4096,1964.040000000012
+lrab_hierarchical_allreduce,torus_2d,6,512,1024,16384,2196.8183333333463
+lrab_hierarchical_allreduce,torus_2d,6,1024,2048,32768,2477.2783333333473
+lrab_hierarchical_allreduce,torus_2d,6,2048,4096,65536,3038.1983333333583
+lrab_hierarchical_allreduce,torus_2d,6,4096,8192,131072,4159.5050000000665
+lrab_hierarchical_allreduce,torus_2d,6,8192,16384,262144,6403.185000000109
+lrab_hierarchical_allreduce,torus_2d,6,16384,32768,524288,10890.5449999995
+lrab_hierarchical_allreduce,torus_2d,6,32768,65536,1048576,19865.265000000378
+lrab_hierarchical_allreduce,torus_2d,6,49152,98304,1572864,28839.98500000059
@@ -0,0 +1,180 @@
+"""Mesh-native bidirectional Ring-K/V attention kernel — prefill (ADR-0059 Proposed).
+
+Each rank holds its own Q tile and 1/n_ranks of K, V (sequence-sharded).
+Over ``n_ranks - 1`` bidirectional steps, K and V propagate both east and
+west: chunk c_i originating at rank i reaches rank j at step ``|i - j|``.
+Every rank receives every other rank's chunk **exactly once** and folds it
+into a running ``(m, ℓ, o)`` via the online-softmax recurrence. After all
+steps each rank holds the final attention output for its own Q tokens —
+no cross-rank merge is required.
+
+Supersedes ADR-0055's closed-ring ``_attention_ring_kv.py``. Both modules
+stay on disk during the transition; this one runs on the hardware's
+actual open-mesh wiring (no closed-ring SFR install required).
+
+Imported by ``milestone_gqa_llama70b`` (after the bench's Phase 2 switches
+its imports) and invoked through ``torch.launch(...)`` — not through
+``dist.all_reduce(...)``. See ADR-0055 Context for why this kernel is not
+backend-dispatched via ADR-0050's algorithm-module contract.
+"""
+from __future__ import annotations
+
+from kernbench.common.pe_commands import TensorHandle
+
+
+def _view(handle: TensorHandle, new_shape: tuple[int, ...]) -> TensorHandle:
+    """Reshape — metadata only, no command emitted (cf. ``tl.trans``)."""
+    return TensorHandle(
+        id=handle.id,
+        addr=handle.addr,
+        shape=new_shape,
+        dtype=handle.dtype,
+        nbytes=handle.nbytes,
+        data=handle.data,
+        space=handle.space,
+        pinned=handle.pinned,
+    )
+
+
+def _partial_attention(
+    Q: TensorHandle,
+    K: TensorHandle,
+    V: TensorHandle,
+    S_q: int,
+    S_kv_per_rank: int,
+    h_q: int,
+    d_head: int,
+    tl,
+) -> tuple[TensorHandle, TensorHandle, TensorHandle]:
+    """One pass of partial attention against (K, V).
+
+    Emits 1 GEMM(Q·K^T) + softmax + max + sub + exp + sum + 1 GEMM(P·V).
+    Returns the running-statistics triplet ``(m, ℓ, O_partial)`` for the
+    online-softmax mlo merge.
+    """
+    K_2d_T = _view(K, (h_q * d_head, S_kv_per_rank))
+    V_2d = _view(V, (S_kv_per_rank, h_q * d_head))
+
+    scores = tl.dot(Q, K_2d_T)
+    m = tl.max(scores, axis=-1)
+    P = tl.softmax(scores, axis=-1)
+    scores_centered = scores - m
+    exp_scores = tl.exp(scores_centered)
+    ell = tl.sum(exp_scores, axis=-1)
+    O_partial = tl.dot(P, V_2d)
+    return m, ell, O_partial
+
+
+def attention_mesh_kv_kernel(
+    q_ptr: int,
+    k_ptr: int,
+    v_ptr: int,
+    o_ptr: int,
+    S_q: int,
+    S_kv_per_rank: int,
+    h_q: int,
+    h_kv: int,
+    d_head: int,
+    n_ranks: int,
+    rank_axis: int = 0,
+    *,
+    tl,
+) -> None:
+    """Mesh-native bidirectional Ring-K/V attention — see module docstring.
+
+    ``rank_axis`` selects which program-id dimension carries the ring rank:
+        0 — single_user_* panels: rank == tl.program_id(axis=0) (PE id in cube).
+        1 — multi_user_* panels: ring is at the cube level. Only PE 0 in each
+            cube participates; the other 7 hold KV replicas but stay silent.
+    """
+    # For multi_user (rank_axis=1) only PE 0 in each cube runs the ring.
+    if rank_axis != 0 and tl.program_id(axis=0) != 0:
+        return
+    rank = tl.program_id(axis=rank_axis)
+    has_E = rank < n_ranks - 1
+    has_W = rank > 0
+
+    # Q stays put on this rank — loaded once, used in every partial attention.
+    Q = tl.load(q_ptr, shape=(S_q, h_q * d_head), dtype="f16")
+
+    # Local K, V chunk.
+    K = tl.load(k_ptr, shape=(S_kv_per_rank, h_kv, d_head), dtype="f16")
+    V = tl.load(v_ptr, shape=(S_kv_per_rank, h_kv, d_head), dtype="f16")
+
+    # Step 0 (local): partial attention against own K, V — initializes the
+    # running triplet (m, ℓ, o).
+    m, ell, o = _partial_attention(
+        Q, K, V, S_q, S_kv_per_rank, h_q, d_head, tl,
+    )
+
+    # Seed bidirectional waves with own chunk (step-1 send).
+    to_send_east_K: TensorHandle | None = K
+    to_send_east_V: TensorHandle | None = V
+    to_send_west_K: TensorHandle | None = K
+    to_send_west_V: TensorHandle | None = V
+
+    # Bidirectional fan-out: n_ranks - 1 steps. By step k, the wave from
+    # rank i has reached rank (i ± k). After n_ranks - 1 steps, every rank
+    # has merged every other rank's chunk exactly once (ADR-0059 D3).
+    for step in range(1, n_ranks):
+        # Send the eastbound wave we currently hold (own at step 1; forwarded
+        # at later steps). ``None`` means we have no wave to forward this
+        # direction this step (edge rank, or the wave already passed by).
+        if has_E and to_send_east_K is not None:
+            tl.send(dir="E", src=to_send_east_K)
+            tl.send(dir="E", src=to_send_east_V)
+        if has_W and to_send_west_K is not None:
+            tl.send(dir="W", src=to_send_west_K)
+            tl.send(dir="W", src=to_send_west_V)
+
+        # Receive eastbound wave from W (carries chunk c_{rank - step}).
+        K_from_W: TensorHandle | None = None
+        V_from_W: TensorHandle | None = None
+        if has_W and (rank - step) >= 0:
+            K_from_W = tl.recv(
+                dir="W", shape=(S_kv_per_rank, h_kv, d_head), dtype="f16",
+            )
+            V_from_W = tl.recv(
+                dir="W", shape=(S_kv_per_rank, h_kv, d_head), dtype="f16",
+            )
+            m_new, ell_new, o_new = _partial_attention(
+                Q, K_from_W, V_from_W, S_q, S_kv_per_rank, h_q, d_head, tl,
+            )
+            m_combined = tl.maximum(m, m_new)
+            scale_old = tl.exp(m - m_combined)
+            scale_new = tl.exp(m_new - m_combined)
+            ell = ell * scale_old + ell_new * scale_new
+            o = o * scale_old + o_new * scale_new
+            m = m_combined
+
+        # Receive westbound wave from E (carries chunk c_{rank + step}).
+        K_from_E: TensorHandle | None = None
+        V_from_E: TensorHandle | None = None
+        if has_E and (rank + step) < n_ranks:
+            K_from_E = tl.recv(
+                dir="E", shape=(S_kv_per_rank, h_kv, d_head), dtype="f16",
+            )
+            V_from_E = tl.recv(
+                dir="E", shape=(S_kv_per_rank, h_kv, d_head), dtype="f16",
+            )
+            m_new, ell_new, o_new = _partial_attention(
+                Q, K_from_E, V_from_E, S_q, S_kv_per_rank, h_q, d_head, tl,
+            )
+            m_combined = tl.maximum(m, m_new)
+            scale_old = tl.exp(m - m_combined)
+            scale_new = tl.exp(m_new - m_combined)
+            ell = ell * scale_old + ell_new * scale_new
+            o = o * scale_old + o_new * scale_new
+            m = m_combined
+
+        # Forward what we received for next step. ``None`` propagates: if no
+        # chunk arrived this step (out-of-bounds wave origin), there is
+        # nothing to forward next step in that direction.
+        to_send_east_K = K_from_W
+        to_send_east_V = V_from_W
+        to_send_west_K = K_from_E
+        to_send_west_V = V_from_E
+
+    # Final normalize: O := o / ℓ.
+    O_final = o / ell
+    tl.store(o_ptr, O_final)
@@ -0,0 +1,151 @@
+"""Mesh-native bidirectional AllReduce-mlo attention — decode (ADR-0059 Proposed).
+
+Every rank holds the full Q (replicated, small at ``S_q=1``) and 1/n_ranks
+of KV (sequence-sharded). Each rank computes its partial attention
+against own KV in ONE shot, then runs a bidirectional fan-out of the
+``(m, ℓ, o)`` triplet: the triplet originating at rank i reaches rank j at
+step ``|i - j|``. Every rank merges every other rank's triplet exactly
+once over ``n_ranks - 1`` steps, ending with the final answer replicated
+on every rank.
+
+Supersedes ADR-0056's closed-ring ``_attention_allreduce_mlo.py``. Both
+modules stay on disk during the transition; this one runs on the
+hardware's actual open-mesh wiring (no closed-ring SFR install required).
+
+Imported by ``milestone_gqa_llama70b`` (after the bench's Phase 2 switches
+its imports) and invoked through ``torch.launch(...)`` — not through
+``dist.all_reduce(...)``. See ADR-0056 Context for why this kernel is not
+backend-dispatched via ADR-0050's algorithm-module contract.
+"""
+from __future__ import annotations
+
+from kernbench.common.pe_commands import TensorHandle
+
+
+def _view(handle: TensorHandle, new_shape: tuple[int, ...]) -> TensorHandle:
+    """Reshape — metadata only, no command emitted (cf. ``tl.trans``)."""
+    return TensorHandle(
+        id=handle.id,
+        addr=handle.addr,
+        shape=new_shape,
+        dtype=handle.dtype,
+        nbytes=handle.nbytes,
+        data=handle.data,
+        space=handle.space,
+        pinned=handle.pinned,
+    )
+
+
+def attention_mesh_mlo_kernel(
+    q_ptr: int,
+    k_ptr: int,
+    v_ptr: int,
+    o_ptr: int,
+    S_q: int,
+    S_kv_per_rank: int,
+    h_q: int,
+    h_kv: int,
+    d_head: int,
+    n_ranks: int,
+    rank_axis: int = 0,
+    *,
+    tl,
+) -> None:
+    """Mesh-native bidirectional AllReduce-mlo — see module docstring.
+
+    ``rank_axis`` selects which program-id dimension carries the ring rank:
+        0 — single_user_* panels: rank == tl.program_id(axis=0) (PE id in cube).
+        1 — multi_user_* panels: ring is at the cube level. Only PE 0 in each
+            cube participates; the other 7 hold KV replicas but stay silent.
+    """
+    # For multi_user (rank_axis=1) only PE 0 in each cube runs the ring.
+    if rank_axis != 0 and tl.program_id(axis=0) != 0:
+        return
+    rank = tl.program_id(axis=rank_axis)
+    has_E = rank < n_ranks - 1
+    has_W = rank > 0
+
+    # Q is replicated on every rank — loaded once.
+    Q = tl.load(q_ptr, shape=(S_q, h_q * d_head), dtype="f16")
+
+    # Local KV chunk. KV is sequence-sharded and stays put on this rank for
+    # the entire fan-out — distinguishing decode from prefill (ADR-0059 D3)
+    # where KV circulates.
+    K = tl.load(k_ptr, shape=(S_kv_per_rank, h_kv, d_head), dtype="f16")
+    V = tl.load(v_ptr, shape=(S_kv_per_rank, h_kv, d_head), dtype="f16")
+
+    # ── One-shot local partial attention ──────────────────────────
+    K_2d_T = _view(K, (h_q * d_head, S_kv_per_rank))
+    V_2d = _view(V, (S_kv_per_rank, h_q * d_head))
+    scores = tl.dot(Q, K_2d_T)
+    m = tl.max(scores, axis=-1)
+    P = tl.softmax(scores, axis=-1)
+    scores_centered = scores - m
+    exp_scores = tl.exp(scores_centered)
+    ell = tl.sum(exp_scores, axis=-1)
+    o = tl.dot(P, V_2d)
+
+    # Seed bidirectional waves with own triplet (step-1 send).
+    to_send_east_m: TensorHandle | None = m
+    to_send_east_ell: TensorHandle | None = ell
+    to_send_east_o: TensorHandle | None = o
+    to_send_west_m: TensorHandle | None = m
+    to_send_west_ell: TensorHandle | None = ell
+    to_send_west_o: TensorHandle | None = o
+
+    # Bidirectional fan-out of (m, ℓ, o) triplets — n_ranks - 1 steps.
+    for step in range(1, n_ranks):
+        # Send eastbound triplet (own at step 1; forwarded at later steps).
+        if has_E and to_send_east_m is not None:
+            tl.send(dir="E", src=to_send_east_m)
+            tl.send(dir="E", src=to_send_east_ell)
+            tl.send(dir="E", src=to_send_east_o)
+        # Send westbound triplet.
+        if has_W and to_send_west_m is not None:
+            tl.send(dir="W", src=to_send_west_m)
+            tl.send(dir="W", src=to_send_west_ell)
+            tl.send(dir="W", src=to_send_west_o)
+
+        # Receive eastbound triplet from W (originated at rank - step).
+        m_from_W: TensorHandle | None = None
+        ell_from_W: TensorHandle | None = None
+        o_from_W: TensorHandle | None = None
+        if has_W and (rank - step) >= 0:
+            m_from_W = tl.recv(dir="W", shape=m.shape, dtype="f16")
+            ell_from_W = tl.recv(dir="W", shape=ell.shape, dtype="f16")
+            o_from_W = tl.recv(dir="W", shape=o.shape, dtype="f16")
+            m_combined = tl.maximum(m, m_from_W)
+            scale_old = tl.exp(m - m_combined)
+            scale_new = tl.exp(m_from_W - m_combined)
+            ell = ell * scale_old + ell_from_W * scale_new
+            o = o * scale_old + o_from_W * scale_new
+            m = m_combined
+
+        # Receive westbound triplet from E (originated at rank + step).
+        m_from_E: TensorHandle | None = None
+        ell_from_E: TensorHandle | None = None
+        o_from_E: TensorHandle | None = None
+        if has_E and (rank + step) < n_ranks:
+            m_from_E = tl.recv(dir="E", shape=m.shape, dtype="f16")
+            ell_from_E = tl.recv(dir="E", shape=ell.shape, dtype="f16")
+            o_from_E = tl.recv(dir="E", shape=o.shape, dtype="f16")
+            m_combined = tl.maximum(m, m_from_E)
+            scale_old = tl.exp(m - m_combined)
+            scale_new = tl.exp(m_from_E - m_combined)
+            ell = ell * scale_old + ell_from_E * scale_new
+            o = o * scale_old + o_from_E * scale_new
+            m = m_combined
+
+        # Forward the original received triplet (not the merged running state)
+        # so neighbors get the original wave. ``None`` propagates if nothing
+        # arrived this step.
+        to_send_east_m = m_from_W
+        to_send_east_ell = ell_from_W
+        to_send_east_o = o_from_W
+        to_send_west_m = m_from_E
+        to_send_west_ell = ell_from_E
+        to_send_west_o = o_from_E
+
+    # Final normalize: O := o / ℓ.
+    O_final = o / ell
+    tl.store(o_ptr, O_final)
@@ -0,0 +1,568 @@
+"""milestone-1h-gemm bench: GEMM evaluation harness (sweep + figures).
+
+Self-contained milestone bench (ADR-0054). Holds the shape×variant sweep
+and the figure renderers; the ``run(torch)`` entry at the bottom runs the
+sweep (or reuses the committed JSON when ``MILESTONE_FAST=1``) and writes
+every figure into ``benches/1H_milestone_output/gemm/``.
+
+This is the single home for the GEMM eval logic: the figure tests import a
+thin re-export shim (``tests/gemm/_gemm_plot_helpers.py``), as does the
+``scripts/gemm_sweep.py`` wrapper.
+
+The sweep drives ``matmul-composite`` across shapes×variants through the
+same ``run_bench`` path the CLI uses, harvests ``result.engine.op_log``,
+and writes the sweep JSON. The renderers read that JSON and emit matplotlib
+PNGs. No simulation in the renderers — they are fast.
+
+Chart set (mirrors the GEMM MAC slides in scripts/build_overview_slides.py):
+  - stage breakdown (load_ref operand staging)
+  - MAC utilization — measured (load_ref)
+  - MAC utilization — theoretical vs measured (load_ref)
+"""
+from __future__ import annotations
+
+import json
+import os
+import sys
+import time
+from pathlib import Path
+
+from kernbench.benches.registry import bench
+from kernbench.policy.placement.dp import DPPolicy
+
+ROOT = Path(__file__).resolve().parents[3]
+DEFAULT_SWEEP_JSON = ROOT / "docs" / "diagrams" / "gemm_sweep.json"
+DEFAULT_PLOTS_DIR = ROOT / "docs" / "diagrams" / "gemm_plots"
+_OUTPUT_DIR = Path(__file__).resolve().parent / "1H_milestone_output" / "gemm"
+
+# ── sweep configuration ────────────────────────────────────────────────
+
+# Default sweep covering under-tile, single-tile, multi-tile, and asymmetric
+# regimes. Each entry is "MxKxN" or a single int (square M=K=N).
+# Override via env: SWEEP_SHAPES="16,32,16x2048x16,..."
+DEFAULT_SHAPES = [
+    "32x32x32",       # 1 tile, K=32 < TILE_K=64 → under-tile in K
+    "32x64x32",       # 1 tile, exact single-tile fit
+    "32x128x32",      # 2 tiles, aligned
+    "32x128x128",     # 8 tiles, aligned
+    "32x3072x32",     # 48 tiles, all K-axis (tall-skinny)
+    "8x128x128",      # 8 tiles, but M=8 < TILE_M=32 → MAC array under-fed
+    "128x8x128",      # 16 tiles, but K=8 < TILE_K=64 → MAC array under-fed
+    "512",            # 2048 tiles, fully aligned — "well-pipelined" reference
+]
+
+# Operand-staging variants exercised per shape.
+VARIANTS = ["ref_ref", "load_ref", "load_load"]
+
+# Engines whose timings we collect (component_id suffix match).
+ENGINES = ["pe_dma", "pe_fetch_store", "pe_gemm", "pe_math"]
+
+# Per-stage breakdown labels (StageType enum names from pe_types.py).
+STAGES = ["DMA_READ", "DMA_WRITE", "FETCH", "STORE", "GEMM", "MATH"]
+
+# Scheduler tile sizes (mirror of PeSchedulerComponent.TILE_M/K/N).
+TILE_M, TILE_K, TILE_N = 32, 64, 32
+
+
+def _ceil(a: int, b: int) -> int:
+    return (a + b - 1) // b
+
+
+def _engine_wall_ns(records, suffix: str) -> float:
+    """Wall-clock interval the engine was active (union of overlapping ops)."""
+    intervals = [(r.t_start, r.t_end) for r in records
+                 if r.component_id.endswith("." + suffix)]
+    if not intervals:
+        return 0.0
+    intervals.sort()
+    merged_end = intervals[0][1]
+    merged_start = intervals[0][0]
+    total = 0.0
+    for s, e in intervals[1:]:
+        if s <= merged_end:
+            merged_end = max(merged_end, e)
+        else:
+            total += merged_end - merged_start
+            merged_start, merged_end = s, e
+    total += merged_end - merged_start
+    return total
+
+
+def _engine_occupancy_ns(records, suffix: str) -> float:
+    return sum(r.t_end - r.t_start for r in records
+               if r.component_id.endswith("." + suffix))
+
+
+def _engine_count(records, suffix: str) -> int:
+    return sum(1 for r in records if r.component_id.endswith("." + suffix))
+
+
+def _stage_occupancy_ns(records, stage_type: str) -> float:
+    return sum(
+        r.t_end - r.t_start
+        for r in records
+        if r.params.get("stage_type") == stage_type
+    )
+
+
+def _stage_wall_ns(records, stage_type: str) -> float:
+    """Interval-union wall-clock for records whose stage_type matches."""
+    intervals = sorted(
+        (r.t_start, r.t_end) for r in records
+        if r.params.get("stage_type") == stage_type
+    )
+    if not intervals:
+        return 0.0
+    total = 0.0
+    cs, ce = intervals[0]
+    for s, e in intervals[1:]:
+        if s <= ce:
+            ce = max(ce, e)
+        else:
+            total += ce - cs
+            cs, ce = s, e
+    total += ce - cs
+    return total
+
+
+def _stage_count(records, stage_type: str) -> int:
+    return sum(1 for r in records if r.params.get("stage_type") == stage_type)
+
+
+def _run_one(M: int, K: int, N: int, topology: str, variant: str = "ref_ref") -> dict:
+    os.environ["MATMUL_M"] = str(M)
+    os.environ["MATMUL_K"] = str(K)
+    os.environ["MATMUL_N"] = str(N)
+    os.environ["MATMUL_VARIANT"] = variant
+
+    # Late imports so env vars are read by matmul_composite at module load.
+    # Force re-import to pick up new env values.
+    for mod_name in [m for m in list(sys.modules)
+                     if m.startswith("kernbench.benches.matmul_composite")]:
+        del sys.modules[mod_name]
+
+    from kernbench.benches.registry import resolve as resolve_bench
+    from kernbench.runtime_api.bench_runner import run_bench
+    from kernbench.runtime_api.types import resolve_device
+    from kernbench.sim_engine.engine import GraphEngine
+    from kernbench.topology.builder import resolve_topology
+
+    topo = resolve_topology(topology)
+    bench = resolve_bench("matmul-composite").run
+    device = resolve_device(None)
+
+    t0 = time.time()
+    result = run_bench(
+        topology=topo, bench_fn=bench, device=device,
+        engine_factory=lambda t, d: GraphEngine(
+            getattr(t, "topology_obj", t), enable_data=True,
+        ),
+    )
+    wall = time.time() - t0
+
+    op_log = result.engine.op_log
+    if not result.completion.ok:
+        raise RuntimeError(f"bench failed at M={M},K={K},N={N}: {result.completion}")
+
+    # Bytes touched at f16 (2 B): full A + full B + full out (each operand
+    # streamed once through HBM by the composite plan).
+    bytes_total = (M * K + K * N + M * N) * 2
+    row = {
+        "M": M, "K": K, "N": N,
+        "variant": variant,
+        "flops": 2 * M * K * N,
+        "bytes_hbm": bytes_total,
+        "arith_intensity": (2 * M * K * N) / bytes_total,  # flops/byte
+        "tile_count_expected": _ceil(M, TILE_M) * _ceil(N, TILE_N) * _ceil(K, TILE_K),
+        "sim_wall_clock_s": round(wall, 3),
+        "engines": {},
+    }
+    for eng in ENGINES:
+        row["engines"][eng] = {
+            "occupancy_ns": _engine_occupancy_ns(op_log, eng),
+            "wall_ns":      _engine_wall_ns(op_log, eng),
+            "record_count": _engine_count(op_log, eng),
+        }
+    row["stages"] = {}
+    for stage in STAGES:
+        row["stages"][stage] = {
+            "occupancy_ns": _stage_occupancy_ns(op_log, stage),
+            "wall_ns":      _stage_wall_ns(op_log, stage),
+            "record_count": _stage_count(op_log, stage),
+        }
+    # Kernel-window wall-clock = max t_end - min t_start over PE engine records.
+    pe_records = [r for r in op_log
+                  if any(r.component_id.endswith("." + e) for e in ENGINES)]
+    if pe_records:
+        row["pe_window_ns"] = max(r.t_end for r in pe_records) \
+                              - min(r.t_start for r in pe_records)
+    else:
+        row["pe_window_ns"] = 0.0
+    stage_records = [r for r in op_log
+                     if r.params.get("stage_type") in STAGES]
+    if stage_records:
+        row["composite_window_ns"] = max(r.t_end for r in stage_records) \
+                                     - min(r.t_start for r in stage_records)
+    else:
+        row["composite_window_ns"] = 0.0
+    return row
+
+
+def _parse_shapes(raw) -> list[tuple[int, int, int]]:
+    shapes: list[tuple[int, int, int]] = []
+    for s in raw:
+        s = s.strip()
+        if not s:
+            continue
+        if "x" in s.lower():
+            parts = s.lower().split("x")
+            shapes.append((int(parts[0]), int(parts[1]), int(parts[2])))
+        else:
+            v = int(s)
+            shapes.append((v, v, v))
+    return shapes
+
+
+def run_sweep(out_json: Path | str = DEFAULT_SWEEP_JSON) -> Path:
+    """Drive matmul-composite across shapes×variants; write the sweep JSON.
+
+    Honors ``SWEEP_SHAPES`` / ``SWEEP_TOPOLOGY`` env overrides (same as the
+    historical ``scripts/gemm_sweep.py``). Returns the JSON path written.
+    """
+    shapes_env = os.environ.get("SWEEP_SHAPES")
+    raw = (shapes_env.split(",") if shapes_env else DEFAULT_SHAPES)
+    shapes = _parse_shapes(raw)
+    topology = os.environ.get("SWEEP_TOPOLOGY", "topology.yaml")
+
+    rows = []
+    for M, K, N in shapes:
+        for variant in VARIANTS:
+            print(f"[sweep] M={M} K={K} N={N} variant={variant} ...", flush=True)
+            row = _run_one(M, K, N, topology, variant=variant)
+            rows.append(row)
+            eng_dma = row["engines"]["pe_dma"]
+            eng_gem = row["engines"]["pe_gemm"]
+            print(f"   tiles={row['tile_count_expected']:>6}  "
+                  f"pe_window={row['pe_window_ns']:8.1f}ns  "
+                  f"dma_occ={eng_dma['occupancy_ns']:9.1f}  "
+                  f"gemm_occ={eng_gem['occupancy_ns']:8.1f}  "
+                  f"(sim {row['sim_wall_clock_s']:.1f}s)")
+
+    out_json = Path(out_json)
+    out_json.parent.mkdir(parents=True, exist_ok=True)
+    out_json.write_text(json.dumps({
+        "tile_sizes": {"M": TILE_M, "K": TILE_K, "N": TILE_N},
+        "engines": ENGINES,
+        "stages": STAGES,
+        "variants": VARIANTS,
+        "rows": rows,
+    }, indent=2))
+    print(f"\n[sweep] wrote {out_json}")
+    return out_json
+
+
+# ── figure rendering ───────────────────────────────────────────────────
+
+# Shapes excluded from the figures (mirrors build_overview_slides).
+EXCLUDED_SHAPES = {(512, 512, 512)}
+
+# Stage bars shown (raw op_log stage_type keys) + display names + colors.
+STAGE_KEYS = ["DMA_READ", "FETCH", "GEMM", "DMA_WRITE"]
+STAGE_DISPLAY = {
+    "DMA_READ":  "DMA in",
+    "FETCH":     "Fetch",
+    "GEMM":      "GEMM",
+    "DMA_WRITE": "DMA out",
+}
+STAGE_COLORS = {
+    "DMA_READ":  "#3B82F6",
+    "FETCH":     "#10B981",
+    "GEMM":      "#F59E0B",
+    "DMA_WRITE": "#A855F7",
+}
+
+# MAC-utilization model constants (mirror build_overview_slides).
+_HBM_GBS = 256.0
+_BPE = 2
+_T_STAGE = 16.0
+_D_STAGES = 3
+
+_PLOT_VARIANT = "load_ref"
+
+
+def _load_sweep_data(sweep_json: Path | str = DEFAULT_SWEEP_JSON) -> dict:
+    sweep_json = Path(sweep_json)
+    if not sweep_json.exists():
+        return {"rows": []}
+    data = json.loads(sweep_json.read_text())
+    data["rows"] = [
+        r for r in data.get("rows", [])
+        if (r["M"], r["K"], r["N"]) not in EXCLUDED_SHAPES
+    ]
+    return data
+
+
+def _shape_label(r: dict) -> str:
+    if r["M"] == r["K"] == r["N"]:
+        return f"M=K=N={r['M']}"
+    return f"M={r['M']} K={r['K']} N={r['N']}"
+
+
+def _under_tile(M, K, N, tile_M, tile_K, tile_N) -> bool:
+    return M < tile_M or K < tile_K or N < tile_N
+
+
+def _xtick_labels(shape_labels, tile_counts, flagged) -> list[str]:
+    out = []
+    for lbl, tc, fl in zip(shape_labels, tile_counts, flagged):
+        s = f"{lbl}\n({tc} tiles)"
+        if fl:
+            s += " *"
+        out.append(s)
+    return out
+
+
+def _grouped_bar_png(
+    out_name: str, *, out_dir: Path, title: str, subtitle: str | None,
+    shape_labels, tile_counts, flagged, series: dict, colors: dict,
+    y_label: str, threshold: float | None = None, footnote: str | None = None,
+) -> str:
+    """Render one grouped-bar chart to out_dir/out_name; return the path."""
+    import matplotlib.pyplot as plt
+    import numpy as np
+
+    n_groups = len(shape_labels)
+    n_series = max(1, len(series))
+    x = np.arange(n_groups)
+    width = 0.8 / n_series
+
+    fig, ax = plt.subplots(figsize=(11, 6))
+    for i, (name, vals) in enumerate(series.items()):
+        offset = (i - (n_series - 1) / 2) * width
+        ax.bar(x + offset, vals, width, label=name, color=colors.get(name))
+
+    ax.set_xticks(x)
+    ax.set_xticklabels(
+        _xtick_labels(shape_labels, tile_counts, flagged), fontsize=8,
+    )
+    ax.set_ylabel(y_label)
+    ax.set_title(title, fontsize=13, fontweight="bold")
+    if subtitle:
+        ax.text(0.5, 1.01, subtitle, transform=ax.transAxes, ha="center",
+                va="bottom", fontsize=8, color="#475569")
+    if threshold is not None:
+        ax.axhline(threshold, ls="--", color="gray", lw=1.0)
+    ax.legend(fontsize=8, loc="upper right")
+    ax.grid(True, axis="y", alpha=0.3)
+
+    caption = "* = under-tile shape (M<TILE_M, K<TILE_K, or N<TILE_N)"
+    if footnote:
+        caption = footnote + "\n" + caption
+    fig.text(0.5, 0.01, caption, ha="center", fontsize=7, color="gray",
+             wrap=True)
+
+    fig.tight_layout(rect=(0, 0.05, 1, 1))
+    out_dir = Path(out_dir)
+    out_dir.mkdir(parents=True, exist_ok=True)
+    out = out_dir / out_name
+    fig.savefig(out, dpi=120)
+    plt.close(fig)
+    return str(out)
+
+
+def emit_stage_breakdown(
+    sweep_json: Path | str = DEFAULT_SWEEP_JSON,
+    out_dir: Path | str = DEFAULT_PLOTS_DIR,
+) -> str | None:
+    """Per-stage engine wall-clock per shape (load_ref operand staging)."""
+    data = _load_sweep_data(sweep_json)
+    rows = [r for r in data["rows"] if r.get("variant") == _PLOT_VARIANT]
+    if not rows:
+        return None
+    tile = data["tile_sizes"]
+    shape_labels = [_shape_label(r) for r in rows]
+    flagged = [_under_tile(r["M"], r["K"], r["N"], tile["M"], tile["K"], tile["N"])
+               for r in rows]
+    tile_counts = [r["tile_count_expected"] for r in rows]
+    series = {
+        STAGE_DISPLAY[s]: [r.get("stages", {}).get(s, {}).get("wall_ns", 0.0)
+                           for r in rows]
+        for s in STAGE_KEYS
+    }
+    colors = {STAGE_DISPLAY[s]: STAGE_COLORS[s] for s in STAGE_KEYS}
+    return _grouped_bar_png(
+        "gemm_stage_breakdown.png", out_dir=Path(out_dir),
+        title="GEMM stage breakdown",
+        subtitle=(f"Per-stage engine wall-clock (DMA in / Fetch / GEMM / "
+                  f"DMA out), {_PLOT_VARIANT} staging. "
+                  f"Tile {tile['M']}x{tile['K']}x{tile['N']}."),
+        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
+        series=series, colors=colors, y_label="ns",
+        footnote="Bars = engine wall-clock interval (merged overlaps).",
+    )
+
+
+def emit_mac_utilization_measured(
+    sweep_json: Path | str = DEFAULT_SWEEP_JSON,
+    out_dir: Path | str = DEFAULT_PLOTS_DIR,
+) -> str | None:
+    """GEMM util % and useful pipeline-eff % (analytical model, load_ref)."""
+    data = _load_sweep_data(sweep_json)
+    rows = data["rows"]
+    if not rows:
+        return None
+    tile = data["tile_sizes"]
+    TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
+    tile_flops = 2 * TILE_M * TILE_K * TILE_N
+    dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
+    head_ns = (_D_STAGES - 1) * _T_STAGE
+
+    by_shape = {(r["M"], r["K"], r["N"]): r
+                for r in rows if r["variant"] == _PLOT_VARIANT}
+    shapes = list(by_shape)
+    if not shapes:
+        return None
+    shape_labels = [_shape_label(by_shape[k]) for k in shapes]
+    flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
+    tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
+
+    gemm_util, useful_eff = [], []
+    for k in shapes:
+        r = by_shape[k]
+        M, K, N = r["M"], r["K"], r["N"]
+        useful = 2 * M * K * N
+        tiles = r["tile_count_expected"]
+        gu = useful / (tile_flops * tiles) * 100
+        gemm_util.append(gu)
+        m_tiles = (M + TILE_M - 1) // TILE_M
+        n_tiles = (N + TILE_N - 1) // TILE_N
+        n_mn = m_tiles * n_tiles
+        compute_total = tiles * _T_STAGE
+        wall = head_ns + tiles * _T_STAGE + max(0, n_mn - 1) * dma_w_per_pair
+        ueff = (compute_total * (gu / 100.0) / wall) * 100 if wall > 0 else 0.0
+        useful_eff.append(ueff)
+
+    series = {"GEMM util %": gemm_util, "Useful eff %": useful_eff}
+    colors = {"GEMM util %": "#10B981", "Useful eff %": "#F59E0B"}
+    return _grouped_bar_png(
+        "gemm_mac_utilization_measured.png", out_dir=Path(out_dir),
+        title="GEMM MAC utilization — load_ref",
+        subtitle=("GEMM util = useful FLOPs / (tile FLOPs x tiles); "
+                  "Useful eff = GEMM util x ideal pipeline efficiency."),
+        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
+        series=series, colors=colors, y_label="%", threshold=100.0,
+        footnote="Theoretical ideal-pipeline model (not simulator data).",
+    )
+
+
+def emit_mac_utilization_theoretical_vs_measured(
+    sweep_json: Path | str = DEFAULT_SWEEP_JSON,
+    out_dir: Path | str = DEFAULT_PLOTS_DIR,
+) -> str | None:
+    """Theoretical vs simulator-measured GEMM util / useful eff (load_ref)."""
+    data = _load_sweep_data(sweep_json)
+    rows = data["rows"]
+    if not rows:
+        return None
+    tile = data["tile_sizes"]
+    TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
+    tile_flops = 2 * TILE_M * TILE_K * TILE_N
+    dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
+    head_ns = (_D_STAGES - 1) * _T_STAGE
+    peak_per_ns = tile_flops / _T_STAGE
+
+    by_shape = {(r["M"], r["K"], r["N"]): r
+                for r in rows if r["variant"] == _PLOT_VARIANT}
+    shapes = list(by_shape)
+    if not shapes:
+        return None
+    shape_labels = [_shape_label(by_shape[k]) for k in shapes]
+    flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
+    tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
+
+    gu_t, gu_m, eff_t, eff_m = [], [], [], []
+    for k in shapes:
+        r = by_shape[k]
+        M, K, N = r["M"], r["K"], r["N"]
+        useful = 2 * M * K * N
+        tiles = r["tile_count_expected"]
+        gut = useful / (tile_flops * tiles)
+        gu_t.append(gut * 100)
+        rec = r.get("stages", {}).get("GEMM", {}).get("record_count", 0) or tiles
+        gu_m.append((useful / (tile_flops * rec) * 100) if rec else 0.0)
+        m_tiles = (M + TILE_M - 1) // TILE_M
+        n_tiles = (N + TILE_N - 1) // TILE_N
+        n_mn = m_tiles * n_tiles
+        compute_total = tiles * _T_STAGE
+        wall_t = head_ns + compute_total + max(0, n_mn - 1) * dma_w_per_pair
+        eff_t.append((compute_total * gut / wall_t * 100) if wall_t > 0 else 0.0)
+        cw = r.get("composite_window_ns", 0.0) or 0.0
+        eff_m.append((useful / cw / peak_per_ns * 100) if cw > 0 else 0.0)
+
+    series = {
+        "GEMM util % (theoretical)": gu_t,
+        "GEMM util % (measured)":    gu_m,
+        "Theoretical eff %":         eff_t,
+        "Measured eff %":            eff_m,
+    }
+    colors = {
+        "GEMM util % (theoretical)": "#10B981",
+        "GEMM util % (measured)":    "#6EE7B7",
+        "Theoretical eff %":         "#F59E0B",
+        "Measured eff %":            "#3B82F6",
+    }
+    return _grouped_bar_png(
+        "gemm_mac_utilization_theoretical_vs_measured.png", out_dir=Path(out_dir),
+        title="GEMM MAC utilization — theoretical vs measured (load_ref)",
+        subtitle=("theoretical model vs simulator op_log; agreement "
+                  "validates the analytical pipeline model."),
+        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
+        series=series, colors=colors, y_label="%", threshold=100.0,
+    )
+
+
+def emit_all_gemm_plots(
+    sweep_json: Path | str = DEFAULT_SWEEP_JSON,
+    out_dir: Path | str = DEFAULT_PLOTS_DIR,
+) -> list[str]:
+    """Render every GEMM figure that has data; return the paths written."""
+    paths = []
+    for fn in (emit_stage_breakdown,
+               emit_mac_utilization_measured,
+               emit_mac_utilization_theoretical_vs_measured):
+        p = fn(sweep_json, out_dir)
+        if p:
+            paths.append(p)
+    return paths
+
+
+# ── bench entry ────────────────────────────────────────────────────────
+
+
+@bench(
+    name="milestone-1h-gemm",
+    description="1H milestone: regenerate all GEMM results + figures.",
+)
+def run(torch) -> None:
+    """Run the GEMM sweep (or reuse committed JSON) and render every figure.
+
+    ``MILESTONE_FAST=1`` reuses the committed ``DEFAULT_SWEEP_JSON`` (seconds);
+    otherwise the full sweep runs into ``out_dir/gemm_sweep.json`` (minutes).
+    The sweep drives its own engines, so a sentinel tensor is submitted at the
+    end to satisfy the run_bench contract (ADR-0045 D4).
+    """
+    _OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+    fast = bool(os.environ.get("MILESTONE_FAST"))
+    if fast:
+        sweep_json = DEFAULT_SWEEP_JSON
+    else:
+        sweep_json = run_sweep(out_json=_OUTPUT_DIR / "gemm_sweep.json")
+    paths = emit_all_gemm_plots(sweep_json=sweep_json, out_dir=_OUTPUT_DIR)
+    print(f"  milestone-1h-gemm: {len(paths)} figures -> {_OUTPUT_DIR} "
+          f"(fast={fast})")
+
+    torch.zeros(
+        (1, 1), dtype="f16",
+        dp=DPPolicy(cube="row_wise", pe="replicate", num_cubes=1, num_pes=1),
+        name="milestone_gemm_sentinel",
+    )
@@ -609,6 +609,7 @@ class RuntimeContext:
        kernel_fn: Any,
        *args: Any,
        _defer_wait: bool = False,
+        _auto_dim_remap: bool = True,
        **kwargs: Any,
    ) -> RequestHandle:
        """Register and launch a kernel (like a fused torch op).
@@ -700,21 +701,36 @@ class RuntimeContext:
                return t.shape
            # ADR-0026: DPPolicy no longer crosses SIP boundaries; cube + PE
            # are the only axes that shrink the local shape.
+            # Mirror the tensor allocator's precedence (context.py L471-484):
+            # DPPolicy.num_cubes overrides the topology's cube count when set.
+            # Without this, multi_user panels at validation scale
+            # (DPPolicy.num_cubes=4) get sharded as if the topology's full
+            # cube count (16) applied — see test_launch_dim_translation.py.
            if dp.cube != "replicate":
+                eff_num_cubes = (
+                    dp.num_cubes if dp.num_cubes is not None else self._num_cubes
+                )
                if dp.cube == "column_wise":
-                    K = K // self._num_cubes
+                    K = K // eff_num_cubes
                elif dp.cube == "row_wise":
-                    M = M // self._num_cubes
+                    M = M // eff_num_cubes
            if len(t.shape) < 2:
                return (K,)
            return (M, K)

+        # Auto-dim-remap (opt-out via _auto_dim_remap=False). Legacy
+        # kernels (e.g. va_offset bench) pass global dims as scalars and
+        # rely on launch to rewrite them to local. Mesh attention kernels
+        # already receive cube-local dims (S_kv_per_rank, d_head, …) and
+        # opt out — the remap would otherwise collide d_head=64 with K's
+        # global M=64 and rewrite d_head. See test_launch_dim_translation.py.
        dim_map: dict[int, int] = {}  # global_dim → local_dim
-        for t in tensor_args:
-            local = _compute_local_shape(t)
-            for g, l in zip(t.shape if len(t.shape) >= 2 else (1, t.shape[0]), local if len(local) >= 2 else (1, local[0])):
-                if g != l:
-                    dim_map[g] = l
+        if _auto_dim_remap:
+            for t in tensor_args:
+                local = _compute_local_shape(t)
+                for g, l in zip(t.shape if len(t.shape) >= 2 else (1, t.shape[0]), local if len(local) >= 2 else (1, local[0])):
+                    if g != l:
+                        dim_map[g] = l

        # Per-SIP kernel launch: each SIP gets TensorArgs with local va_base
        last_handle = None
@@ -25,6 +25,37 @@ class DataExecutor:
    def __init__(self, op_log: list[OpRecord], store: MemoryStore) -> None:
        self._op_log = op_log
        self.store = store
+        # Per-slot time-ordered shape-keyed history. Populated on every
+        # ipcq_copy WRITE; consulted on reads that find a shape-mismatched
+        # value in MemoryStore (the slot was wrapped by a later inbound
+        # before this read's Phase 2 turn). Required because Phase 1 cannot
+        # snapshot math-output sources at outbound time (math executes only
+        # in Phase 2), so token.data is None and slot wraps lose the recv-
+        # time value. See test_attention_mesh_decode_diag (ADR-0059 mesh).
+        self._slot_history: dict[tuple[str, int], list[tuple[float, Any]]] = {}
+
+    def _resolve_read(
+        self, space: str, addr: int,
+        shape: tuple[int, ...] | None, dtype: str | None,
+        t_at_or_before: float,
+    ) -> Any:
+        """Read (space, addr) with expected shape. On KeyError or shape
+        mismatch in MemoryStore, fall back to ``_slot_history`` for the
+        most recent shape-matching entry with t_write <= t_at_or_before.
+        Returns None when no match is found."""
+        try:
+            return self.store.read(space, addr, shape=shape, dtype=dtype)
+        except (KeyError, ValueError):
+            pass
+        hist = self._slot_history.get((space, addr))
+        if hist is None:
+            return None
+        for t_w, val in reversed(hist):
+            if t_w > t_at_or_before:
+                continue
+            if shape is None or getattr(val, "shape", None) == shape:
+                return val
+        return None

    # Ordering priority within the same t_start: memory copies must run
    # before math/gemm so that slot data is populated before a consumer
@@ -87,14 +118,23 @@ class DataExecutor:
            # only get populated by Phase 2's math replay).
            data = p.get("snapshot")
            if data is None:
-                try:
-                    data = self.store.read(
-                        src_space, src_addr,
-                        shape=p.get("shape"), dtype=p.get("dtype"),
-                    )
-                except KeyError:
+                data = self._resolve_read(
+                    src_space, src_addr,
+                    p.get("shape"), p.get("dtype"), op.t_start,
+                )
+                if data is None:
                    return
            self.store.write(dst_space, dst_addr, data)
+            # Record this write in slot history so a later forwarded read
+            # at src=dst_addr (a different ipcq_copy whose src is this slot)
+            # can recover by shape even after the slot has been wrapped.
+            if op.op_name == "ipcq_copy":
+                self._slot_history.setdefault(
+                    (dst_space, dst_addr), [],
+                ).append((
+                    op.t_start,
+                    data.copy() if hasattr(data, "copy") else data,
+                ))

    def _execute_gemm(self, op: OpRecord) -> None:
        """Execute GEMM: out = a @ b."""
@@ -110,10 +150,16 @@ class DataExecutor:
        dtype_in = p.get("dtype_in", "f16")
        dtype_out = p.get("dtype_out", dtype_in)

-        a = self.store.read(src_a_space, p["src_a_addr"],
-                            shape=p.get("shape_a"), dtype=dtype_in)
-        b = self.store.read(src_b_space, p["src_b_addr"],
-                            shape=p.get("shape_b"), dtype=dtype_in)
+        a = self._resolve_read(src_a_space, p["src_a_addr"],
+                               p.get("shape_a"), dtype_in, op.t_start)
+        if a is None:
+            a = self.store.read(src_a_space, p["src_a_addr"],
+                                shape=p.get("shape_a"), dtype=dtype_in)
+        b = self._resolve_read(src_b_space, p["src_b_addr"],
+                               p.get("shape_b"), dtype_in, op.t_start)
+        if b is None:
+            b = self.store.read(src_b_space, p["src_b_addr"],
+                                shape=p.get("shape_b"), dtype=dtype_in)

        # Compute in higher precision if specified
        dtype_acc = p.get("dtype_acc", "f32")
@@ -150,8 +196,11 @@ class DataExecutor:
        ):
            if snap is not None:
                inputs.append(snap)
-            else:
-                inputs.append(self.store.read(space, addr, shape=shape, dtype=idtype))
+                continue
+            resolved = self._resolve_read(space, addr, shape, idtype, op.t_start)
+            if resolved is None:
+                resolved = self.store.read(space, addr, shape=shape, dtype=idtype)
+            inputs.append(resolved)

        result = _compute_math(math_op, inputs, p.get("axis"))
        if result is not None:
@@ -96,13 +96,20 @@ class OpLogger:
        # gets reused on the next ring round).
        if self._memory_store is not None:
            if op_kind == "math":
+                handle_snaps = params.get("input_handle_data") or ()
                snaps: list[Any] = []
-                for addr, shape, space, idtype in zip(
+                for i, (addr, shape, space, idtype) in enumerate(zip(
                    params.get("input_addrs", []),
                    params.get("input_shapes", []),
                    params.get("input_spaces", []),
                    params.get("input_dtypes", []),
-                ):
+                )):
+                    if i < len(handle_snaps) and handle_snaps[i] is not None:
+                        carried = handle_snaps[i]
+                        snaps.append(
+                            carried.copy() if hasattr(carried, "copy") else carried
+                        )
+                        continue
                    try:
                        arr = self._memory_store.read(
                            space, addr, shape=shape, dtype=idtype,
@@ -111,6 +118,7 @@ class OpLogger:
                    except Exception:
                        snaps.append(None)
                params["input_snapshots"] = snaps
+                params.pop("input_handle_data", None)
            elif op_name == "dma_write":
                # ADR-0027 fix: only snapshot HBM sources. TCM (PE scratch)
                # sources are repopulated by Phase 2 math/gemm replay —
@@ -222,6 +230,7 @@ def _extract_op_info(msg: Any) -> tuple[str, str, dict[str, Any]]:
            "input_shapes": [h.shape for h in msg.inputs],
            "input_spaces": [getattr(h, "space", "tcm") for h in msg.inputs],
            "input_dtypes": [h.dtype for h in msg.inputs],
+            "input_handle_data": tuple(getattr(h, "data", None) for h in msg.inputs),
            "dst_addr": msg.out.addr,
            "dst_space": getattr(msg.out, "space", "tcm"),
            "shape_out": msg.out.shape,
@@ -0,0 +1,198 @@
+"""End-to-end engine drives for the four GQA Llama-70B panels (sub-cycle 4c step 2).
+
+Mirrors the existing single_user_decode diag harness across all four panels
+of the milestone-gqa-llama70b sweep (ADR-0057):
+
+  single_user_prefill   ring-K/V kernel, intracube PE ring  (8 PEs / 1 cube)
+  single_user_decode    allreduce-mlo kernel, intracube PE ring
+  multi_user_prefill    ring-K/V kernel, intercube multisip (4 cubes)
+  multi_user_decode     allreduce-mlo kernel, intercube multisip
+
+Each test runs the panel through ``run_bench`` with ``enable_data=True``
+and asserts ``result.completion.ok``. Failures dump the engine's op_log
+tail and the exception, mirroring the decode-diag harness format.
+
+Validation-scale config matches ADR-0057 D4:
+  S_q_prefill=16, S_kv_per_rank=16, h_q=h_kv=1, d_head=64
+  n_ranks_single_user=8, n_ranks_multi_user=4
+"""
+from __future__ import annotations
+
+import traceback
+from pathlib import Path
+
+import pytest
+
+from kernbench.benches._attention_mesh_kv import attention_mesh_kv_kernel
+from kernbench.benches._attention_mesh_mlo import attention_mesh_mlo_kernel
+from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
+from kernbench.ccl.sfr_config import (
+    configure_sfr_intercube_multisip,
+    configure_sfr_intracube_pe_ring,
+)
+from kernbench.policy.placement.dp import DPPolicy
+from kernbench.runtime_api.bench_runner import run_bench
+from kernbench.runtime_api.types import resolve_device
+from kernbench.sim_engine.engine import GraphEngine
+from kernbench.topology.builder import resolve_topology
+
+TOPOLOGY_PATH = Path(__file__).resolve().parents[2] / "topology.yaml"
+
+S_Q_PREFILL = 16
+S_Q_DECODE = 1
+S_KV_PER_RANK = 16
+H_Q = 1
+H_KV = 1
+D_HEAD = 64
+N_RANKS_SINGLE_USER = 8
+N_RANKS_MULTI_USER = 4
+DTYPE = "f16"
+
+
+# ── Helpers ──────────────────────────────────────────────────────
+
+
+def _engine_factory(t, d):
+    return GraphEngine(getattr(t, "topology_obj", t), enable_data=True)
+
+
+def _run_panel(bench_fn):
+    """Drive a panel through run_bench; return (exc, result, engine)."""
+    topo = resolve_topology(str(TOPOLOGY_PATH))
+    captured: dict = {"engine": None}
+
+    def factory(t, d):
+        eng = _engine_factory(t, d)
+        captured["engine"] = eng
+        return eng
+
+    exc = None
+    result = None
+    try:
+        result = run_bench(
+            topology=topo, bench_fn=bench_fn,
+            device=resolve_device(None), engine_factory=factory,
+        )
+    except BaseException as e:  # noqa: BLE001
+        exc = e
+    return exc, result, captured["engine"]
+
+
+def _assert_ok(name: str, exc, result, engine) -> None:
+    if exc is not None:
+        oplog_len = len(getattr(engine, "op_log", []) or []) if engine else 0
+        print(f"\n========== {name} FAIL ==========")
+        print(f"op_log records before crash: {oplog_len}")
+        print(f"{type(exc).__name__}: {exc}")
+        traceback.print_exception(type(exc), exc, exc.__traceback__)
+        raise AssertionError(
+            f"{name} failed at runtime: {exc}"
+        ) from exc
+    assert result is not None, f"{name}: no result"
+    assert result.completion.ok, f"{name}: completion not ok — {result.completion}"
+
+
+# ── Panel bench fns ──────────────────────────────────────────────
+
+
+def _bench_fn_single_user_prefill(ctx):
+    configure_sfr_intracube_pe_ring(
+        ctx.engine, ctx.spec,
+        resolve_algorithm_config(load_ccl_config(), name="lrab_hierarchical_allreduce"),
+    )
+    n = N_RANKS_SINGLE_USER
+    dp_full = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=n)
+    dp_kv = DPPolicy(cube="replicate", pe="row_wise", num_cubes=1, num_pes=n)
+    q = ctx.zeros((S_Q_PREFILL, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="q")
+    k = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="k")
+    v = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="v")
+    o = ctx.empty((S_Q_PREFILL, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="o")
+    ctx.launch(
+        "single_user_prefill_mesh", attention_mesh_kv_kernel,
+        q, k, v, o,
+        S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, n,
+    )
+
+
+def _bench_fn_single_user_decode(ctx):
+    configure_sfr_intracube_pe_ring(
+        ctx.engine, ctx.spec,
+        resolve_algorithm_config(load_ccl_config(), name="lrab_hierarchical_allreduce"),
+    )
+    n = N_RANKS_SINGLE_USER
+    dp_full = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=n)
+    dp_kv = DPPolicy(cube="replicate", pe="row_wise", num_cubes=1, num_pes=n)
+    q = ctx.zeros((S_Q_DECODE, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="q")
+    k = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="k")
+    v = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="v")
+    o = ctx.empty((S_Q_DECODE, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="o")
+    ctx.launch(
+        "single_user_decode_mesh", attention_mesh_mlo_kernel,
+        q, k, v, o,
+        S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, n,
+    )
+
+
+def _bench_fn_multi_user_prefill(ctx):
+    configure_sfr_intercube_multisip(
+        ctx.engine, ctx.spec,
+        resolve_algorithm_config(load_ccl_config(), name="lrab_hierarchical_allreduce"),
+    )
+    n = N_RANKS_MULTI_USER
+    dp_full = DPPolicy(cube="replicate", pe="replicate", num_cubes=n, num_pes=8)
+    dp_kv = DPPolicy(cube="row_wise", pe="replicate", num_cubes=n, num_pes=8)
+    q = ctx.zeros((S_Q_PREFILL, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="q")
+    k = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="k")
+    v = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="v")
+    o = ctx.empty((S_Q_PREFILL, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="o")
+    ctx.launch(
+        "multi_user_prefill_mesh", attention_mesh_kv_kernel,
+        q, k, v, o,
+        S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, n,
+        1,  # rank_axis=1 → ring at cube level (ADR-0059 multi_user)
+        _auto_dim_remap=False,
+    )
+
+
+def _bench_fn_multi_user_decode(ctx):
+    configure_sfr_intercube_multisip(
+        ctx.engine, ctx.spec,
+        resolve_algorithm_config(load_ccl_config(), name="lrab_hierarchical_allreduce"),
+    )
+    n = N_RANKS_MULTI_USER
+    dp_full = DPPolicy(cube="replicate", pe="replicate", num_cubes=n, num_pes=8)
+    dp_kv = DPPolicy(cube="row_wise", pe="replicate", num_cubes=n, num_pes=8)
+    q = ctx.zeros((S_Q_DECODE, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="q")
+    k = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="k")
+    v = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="v")
+    o = ctx.empty((S_Q_DECODE, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="o")
+    ctx.launch(
+        "multi_user_decode_mesh", attention_mesh_mlo_kernel,
+        q, k, v, o,
+        S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, n,
+        1,  # rank_axis=1 → ring at cube level (ADR-0059 multi_user)
+        _auto_dim_remap=False,
+    )
+
+
+# ── Tests ────────────────────────────────────────────────────────
+
+
+def test_single_user_prefill_through_engine():
+    exc, result, engine = _run_panel(_bench_fn_single_user_prefill)
+    _assert_ok("single_user_prefill", exc, result, engine)
+
+
+def test_single_user_decode_through_engine():
+    exc, result, engine = _run_panel(_bench_fn_single_user_decode)
+    _assert_ok("single_user_decode", exc, result, engine)
+
+
+def test_multi_user_prefill_through_engine():
+    exc, result, engine = _run_panel(_bench_fn_multi_user_prefill)
+    _assert_ok("multi_user_prefill", exc, result, engine)
+
+
+def test_multi_user_decode_through_engine():
+    exc, result, engine = _run_panel(_bench_fn_multi_user_decode)
+    _assert_ok("multi_user_decode", exc, result, engine)
@@ -0,0 +1,172 @@
+"""Phase 1 spec test for ``rank_axis`` parameter on the two mesh kernels.
+
+ADR-0059's mesh kernels currently hard-code ``rank = tl.program_id(axis=0)``,
+which only works for single_user_* panels (rank == pe_id within cube).
+For multi_user_* panels the ring is at the cube level — rank should be
+``cube_id`` (axis=1), and the 7 non-rank-leader PEs in each cube should
+not run the ring (they only hold KV replicas).
+
+This test pins the desired ``rank_axis`` kwarg semantics:
+
+  rank_axis = 0 (default, single_user)
+    rank = tl.program_id(axis=0). Every PE in the cube runs the ring.
+    Existing behavior — no change.
+
+  rank_axis = 1 (multi_user)
+    if tl.program_id(axis=0) != 0: return.  (7/8 PEs early-exit.)
+    rank = tl.program_id(axis=1).
+
+Phase 1 expectation: tests fail today (kernels don't accept the kwarg).
+Phase 2 lands the parameter on both kernels; tests turn green and the
+multi_user_* diag harness clears its first send.
+"""
+from __future__ import annotations
+
+from kernbench.common.ipcq_types import IpcqRecvCmd, IpcqSendCmd
+from kernbench.common.pe_commands import GemmCmd
+from kernbench.triton_emu.tl_context import TLContext, run_kernel
+
+from kernbench.benches._attention_mesh_kv import attention_mesh_kv_kernel
+from kernbench.benches._attention_mesh_mlo import attention_mesh_mlo_kernel
+
+S_Q_PREFILL = 16
+S_Q_DECODE = 1
+S_KV_PER_RANK = 16
+H_Q = 1
+H_KV = 1
+D_HEAD = 64
+N_RANKS_MULTI = 4
+PES_PER_CUBE = 8
+
+Q_PTR = 0x10000
+K_PTR = 0x20000
+V_PTR = 0x30000
+O_PTR = 0x40000
+
+
+def _tl(pe_id: int, cube_id: int, num_pes: int, num_cubes: int) -> TLContext:
+    return TLContext(
+        pe_id=pe_id,
+        num_programs=num_pes,
+        cube_id=cube_id,
+        num_cubes=num_cubes,
+        dispatch_cycles=0,
+        scratch_base=0x80000,
+        scratch_size=1 << 20,
+    )
+
+
+# ── Default rank_axis=0 backward-compat ──────────────────────────
+
+
+def test_mlo_kernel_default_rank_axis_zero_emits_commands_on_all_pes():
+    """rank_axis defaults to 0 → kernel uses pe_id as rank, runs on every
+    PE. Verify by running rank=3 (interior PE) in a single-cube 8-rank
+    setup and asserting at least one GEMM and at least one IPCQ send
+    are emitted (interior ranks send in both directions)."""
+    tl = _tl(pe_id=3, cube_id=0, num_pes=8, num_cubes=1)
+    run_kernel(
+        attention_mesh_mlo_kernel, tl,
+        Q_PTR, K_PTR, V_PTR, O_PTR,
+        S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, 8,
+    )
+    assert any(isinstance(c, GemmCmd) for c in tl.commands), \
+        "default rank_axis=0 must run the kernel (≥1 GEMM)"
+    assert any(isinstance(c, IpcqSendCmd) for c in tl.commands), \
+        "interior rank must emit ≥1 IpcqSendCmd"
+
+
+def test_kv_kernel_default_rank_axis_zero_emits_commands_on_all_pes():
+    tl = _tl(pe_id=3, cube_id=0, num_pes=8, num_cubes=1)
+    run_kernel(
+        attention_mesh_kv_kernel, tl,
+        Q_PTR, K_PTR, V_PTR, O_PTR,
+        S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, 8,
+    )
+    assert any(isinstance(c, GemmCmd) for c in tl.commands)
+    assert any(isinstance(c, IpcqSendCmd) for c in tl.commands)
+
+
+# ── rank_axis=1 multi_user semantics ─────────────────────────────
+
+
+def test_mlo_kernel_rank_axis_one_gates_non_zero_pe_to_no_commands():
+    """rank_axis=1 + pe_id != 0 → kernel must early-return; no GEMM,
+    no DMA, no IPCQ. The 7 non-rank-leader PEs in a multi_user cube
+    must stay completely silent so the cube-level SFR install isn't
+    asked to route sends from PEs that have no neighbors installed."""
+    tl = _tl(pe_id=2, cube_id=1, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
+    run_kernel(
+        attention_mesh_mlo_kernel, tl,
+        Q_PTR, K_PTR, V_PTR, O_PTR,
+        S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
+        rank_axis=1,
+    )
+    assert not any(isinstance(c, GemmCmd) for c in tl.commands), \
+        "pe_id=2 with rank_axis=1 must not emit GEMMs"
+    assert not any(isinstance(c, IpcqSendCmd) for c in tl.commands), \
+        "pe_id=2 with rank_axis=1 must not emit IpcqSendCmd"
+    assert not any(isinstance(c, IpcqRecvCmd) for c in tl.commands), \
+        "pe_id=2 with rank_axis=1 must not emit IpcqRecvCmd"
+
+
+def test_kv_kernel_rank_axis_one_gates_non_zero_pe_to_no_commands():
+    tl = _tl(pe_id=2, cube_id=1, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
+    run_kernel(
+        attention_mesh_kv_kernel, tl,
+        Q_PTR, K_PTR, V_PTR, O_PTR,
+        S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
+        rank_axis=1,
+    )
+    assert not any(isinstance(c, GemmCmd) for c in tl.commands)
+    assert not any(isinstance(c, IpcqSendCmd) for c in tl.commands)
+    assert not any(isinstance(c, IpcqRecvCmd) for c in tl.commands)
+
+
+def test_mlo_kernel_rank_axis_one_pe_zero_uses_cube_id_as_rank():
+    """rank_axis=1 + pe_id == 0 → kernel runs the ring with rank=cube_id.
+    For cube_id=1 in a 4-cube ring, rank=1 is an interior rank: has_E=True
+    AND has_W=True → IPCQ sends emitted in both E and W directions.
+    """
+    tl = _tl(pe_id=0, cube_id=1, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
+    run_kernel(
+        attention_mesh_mlo_kernel, tl,
+        Q_PTR, K_PTR, V_PTR, O_PTR,
+        S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
+        rank_axis=1,
+    )
+    sends = [c for c in tl.commands if isinstance(c, IpcqSendCmd)]
+    assert any(s.direction == "E" for s in sends), \
+        "cube_id=1 (interior) must emit ≥1 E-send"
+    assert any(s.direction == "W" for s in sends), \
+        "cube_id=1 (interior) must emit ≥1 W-send"
+
+
+def test_kv_kernel_rank_axis_one_pe_zero_uses_cube_id_as_rank():
+    tl = _tl(pe_id=0, cube_id=1, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
+    run_kernel(
+        attention_mesh_kv_kernel, tl,
+        Q_PTR, K_PTR, V_PTR, O_PTR,
+        S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
+        rank_axis=1,
+    )
+    sends = [c for c in tl.commands if isinstance(c, IpcqSendCmd)]
+    assert any(s.direction == "E" for s in sends)
+    assert any(s.direction == "W" for s in sends)
+
+
+def test_mlo_kernel_rank_axis_one_west_edge_cube_no_west_sends():
+    """cube_id=0 (west edge) with rank_axis=1: rank=0, has_W=False → no
+    W-direction IPCQ sends. has_E=True → ≥1 E-direction send."""
+    tl = _tl(pe_id=0, cube_id=0, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
+    run_kernel(
+        attention_mesh_mlo_kernel, tl,
+        Q_PTR, K_PTR, V_PTR, O_PTR,
+        S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
+        rank_axis=1,
+    )
+    sends = [c for c in tl.commands if isinstance(c, IpcqSendCmd)]
+    assert any(s.direction == "E" for s in sends), \
+        "west-edge cube_id=0 must still emit ≥1 E-send"
+    assert not any(s.direction == "W" for s in sends), \
+        "west-edge cube_id=0 must NOT emit any W-send (no W neighbor)"
@@ -1,283 +1,31 @@
-"""Shared plotting plumbing for the GEMM figure tests.
+"""Thin re-export shim for the GEMM figure tests.

-Not a test module (no ``test_`` prefix -> pytest does not collect it).
+Not a test module (no ``test_`` prefix → pytest does not collect it).

-Reads the committed ``docs/diagrams/gemm_sweep.json`` (produced by the heavy
-``scripts/gemm_sweep.py`` sim sweep) and renders matplotlib PNGs into
-``docs/diagrams/gemm_plots/``. No simulation here -> the figure tests are fast
-and run by default; regenerating the underlying data stays a manual script.
-
-Chart set (mirrors the GEMM MAC slides in scripts/build_overview_slides.py):
-  - stage breakdown (load_ref operand staging)
-  - MAC utilization — measured (load_ref)
-  - MAC utilization — theoretical vs measured (load_ref)
+The sweep + renderer logic now lives in
+``kernbench.benches.milestone_1h_gemm`` (production single home, ADR-0054,
+also driven by ``scripts/gemm_sweep.py``). The figure tests import the same
+names from here; behavior is unchanged (defaults still target
+``docs/diagrams/gemm_plots/``).
 """
 from __future__ import annotations

-import json
-from pathlib import Path
+from kernbench.benches.milestone_1h_gemm import (
+    DEFAULT_PLOTS_DIR as GEMM_PLOTS_DIR,
+    DEFAULT_SWEEP_JSON as GEMM_SWEEP_JSON,
+    ROOT,
+    emit_all_gemm_plots,
+    emit_mac_utilization_measured,
+    emit_mac_utilization_theoretical_vs_measured,
+    emit_stage_breakdown,
+)

-ROOT = Path(__file__).resolve().parent.parent.parent
-GEMM_SWEEP_JSON = ROOT / "docs" / "diagrams" / "gemm_sweep.json"
-GEMM_PLOTS_DIR = ROOT / "docs" / "diagrams" / "gemm_plots"
-
-# Shapes excluded from the figures (mirrors build_overview_slides).
-EXCLUDED_SHAPES = {(512, 512, 512)}
-
-# Stage bars shown (raw op_log stage_type keys) + display names + colors.
-STAGE_KEYS = ["DMA_READ", "FETCH", "GEMM", "DMA_WRITE"]
-STAGE_DISPLAY = {
-    "DMA_READ":  "DMA in",
-    "FETCH":     "Fetch",
-    "GEMM":      "GEMM",
-    "DMA_WRITE": "DMA out",
-}
-STAGE_COLORS = {
-    "DMA_READ":  "#3B82F6",
-    "FETCH":     "#10B981",
-    "GEMM":      "#F59E0B",
-    "DMA_WRITE": "#A855F7",
-}
-
-# MAC-utilization model constants (mirror build_overview_slides).
-_HBM_GBS = 256.0
-_BPE = 2
-_T_STAGE = 16.0
-_D_STAGES = 3
-
-_PLOT_VARIANT = "load_ref"
-
-
-def _load_sweep_data() -> dict:
-    if not GEMM_SWEEP_JSON.exists():
-        return {"rows": []}
-    data = json.loads(GEMM_SWEEP_JSON.read_text())
-    data["rows"] = [
-        r for r in data.get("rows", [])
-        if (r["M"], r["K"], r["N"]) not in EXCLUDED_SHAPES
-    ]
-    return data
-
-
-def _shape_label(r: dict) -> str:
-    if r["M"] == r["K"] == r["N"]:
-        return f"M=K=N={r['M']}"
-    return f"M={r['M']} K={r['K']} N={r['N']}"
-
-
-def _under_tile(M, K, N, tile_M, tile_K, tile_N) -> bool:
-    return M < tile_M or K < tile_K or N < tile_N
-
-
-def _xtick_labels(shape_labels, tile_counts, flagged) -> list[str]:
-    out = []
-    for lbl, tc, fl in zip(shape_labels, tile_counts, flagged):
-        s = f"{lbl}\n({tc} tiles)"
-        if fl:
-            s += " *"
-        out.append(s)
-    return out
-
-
-def _grouped_bar_png(
-    out_name: str, *, title: str, subtitle: str | None,
-    shape_labels, tile_counts, flagged, series: dict, colors: dict,
-    y_label: str, threshold: float | None = None, footnote: str | None = None,
-) -> str:
-    """Render one grouped-bar chart to GEMM_PLOTS_DIR/out_name; return the path."""
-    import matplotlib.pyplot as plt
-    import numpy as np
-
-    n_groups = len(shape_labels)
-    n_series = max(1, len(series))
-    x = np.arange(n_groups)
-    width = 0.8 / n_series
-
-    fig, ax = plt.subplots(figsize=(11, 6))
-    for i, (name, vals) in enumerate(series.items()):
-        offset = (i - (n_series - 1) / 2) * width
-        ax.bar(x + offset, vals, width, label=name, color=colors.get(name))
-
-    ax.set_xticks(x)
-    ax.set_xticklabels(
-        _xtick_labels(shape_labels, tile_counts, flagged), fontsize=8,
-    )
-    ax.set_ylabel(y_label)
-    ax.set_title(title, fontsize=13, fontweight="bold")
-    if subtitle:
-        ax.text(0.5, 1.01, subtitle, transform=ax.transAxes, ha="center",
-                va="bottom", fontsize=8, color="#475569")
-    if threshold is not None:
-        ax.axhline(threshold, ls="--", color="gray", lw=1.0)
-    ax.legend(fontsize=8, loc="upper right")
-    ax.grid(True, axis="y", alpha=0.3)
-
-    caption = "* = under-tile shape (M<TILE_M, K<TILE_K, or N<TILE_N)"
-    if footnote:
-        caption = footnote + "\n" + caption
-    fig.text(0.5, 0.01, caption, ha="center", fontsize=7, color="gray",
-             wrap=True)
-
-    fig.tight_layout(rect=(0, 0.05, 1, 1))
-    GEMM_PLOTS_DIR.mkdir(parents=True, exist_ok=True)
-    out = GEMM_PLOTS_DIR / out_name
-    fig.savefig(out, dpi=120)
-    plt.close(fig)
-    return str(out)
-
-
-# ── individual chart renderers (read sweep JSON, emit one PNG each) ─────
-
-
-def emit_stage_breakdown() -> str | None:
-    """Per-stage engine wall-clock per shape (load_ref operand staging)."""
-    data = _load_sweep_data()
-    rows = [r for r in data["rows"] if r.get("variant") == _PLOT_VARIANT]
-    if not rows:
-        return None
-    tile = data["tile_sizes"]
-    shape_labels = [_shape_label(r) for r in rows]
-    flagged = [_under_tile(r["M"], r["K"], r["N"], tile["M"], tile["K"], tile["N"])
-               for r in rows]
-    tile_counts = [r["tile_count_expected"] for r in rows]
-    series = {
-        STAGE_DISPLAY[s]: [r.get("stages", {}).get(s, {}).get("wall_ns", 0.0)
-                           for r in rows]
-        for s in STAGE_KEYS
-    }
-    colors = {STAGE_DISPLAY[s]: STAGE_COLORS[s] for s in STAGE_KEYS}
-    return _grouped_bar_png(
-        "gemm_stage_breakdown.png",
-        title="GEMM stage breakdown",
-        subtitle=(f"Per-stage engine wall-clock (DMA in / Fetch / GEMM / "
-                  f"DMA out), {_PLOT_VARIANT} staging. "
-                  f"Tile {tile['M']}x{tile['K']}x{tile['N']}."),
-        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
-        series=series, colors=colors, y_label="ns",
-        footnote="Bars = engine wall-clock interval (merged overlaps).",
-    )
-
-
-def emit_mac_utilization_measured() -> str | None:
-    """GEMM util % and useful pipeline-eff % (analytical model, load_ref)."""
-    data = _load_sweep_data()
-    rows = data["rows"]
-    if not rows:
-        return None
-    tile = data["tile_sizes"]
-    TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
-    tile_flops = 2 * TILE_M * TILE_K * TILE_N
-    dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
-    head_ns = (_D_STAGES - 1) * _T_STAGE
-
-    by_shape = {(r["M"], r["K"], r["N"]): r
-                for r in rows if r["variant"] == _PLOT_VARIANT}
-    shapes = list(by_shape)
-    if not shapes:
-        return None
-    shape_labels = [_shape_label(by_shape[k]) for k in shapes]
-    flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
-    tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
-
-    gemm_util, useful_eff = [], []
-    for k in shapes:
-        r = by_shape[k]
-        M, K, N = r["M"], r["K"], r["N"]
-        useful = 2 * M * K * N
-        tiles = r["tile_count_expected"]
-        gu = useful / (tile_flops * tiles) * 100
-        gemm_util.append(gu)
-        m_tiles = (M + TILE_M - 1) // TILE_M
-        n_tiles = (N + TILE_N - 1) // TILE_N
-        n_mn = m_tiles * n_tiles
-        compute_total = tiles * _T_STAGE
-        wall = head_ns + tiles * _T_STAGE + max(0, n_mn - 1) * dma_w_per_pair
-        ueff = (compute_total * (gu / 100.0) / wall) * 100 if wall > 0 else 0.0
-        useful_eff.append(ueff)
-
-    series = {"GEMM util %": gemm_util, "Useful eff %": useful_eff}
-    colors = {"GEMM util %": "#10B981", "Useful eff %": "#F59E0B"}
-    return _grouped_bar_png(
-        "gemm_mac_utilization_measured.png",
-        title="GEMM MAC utilization — load_ref",
-        subtitle=("GEMM util = useful FLOPs / (tile FLOPs x tiles); "
-                  "Useful eff = GEMM util x ideal pipeline efficiency."),
-        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
-        series=series, colors=colors, y_label="%", threshold=100.0,
-        footnote="Theoretical ideal-pipeline model (not simulator data).",
-    )
-
-
-def emit_mac_utilization_theoretical_vs_measured() -> str | None:
-    """Theoretical vs simulator-measured GEMM util / useful eff (load_ref)."""
-    data = _load_sweep_data()
-    rows = data["rows"]
-    if not rows:
-        return None
-    tile = data["tile_sizes"]
-    TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
-    tile_flops = 2 * TILE_M * TILE_K * TILE_N
-    dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
-    head_ns = (_D_STAGES - 1) * _T_STAGE
-    peak_per_ns = tile_flops / _T_STAGE
-
-    by_shape = {(r["M"], r["K"], r["N"]): r
-                for r in rows if r["variant"] == _PLOT_VARIANT}
-    shapes = list(by_shape)
-    if not shapes:
-        return None
-    shape_labels = [_shape_label(by_shape[k]) for k in shapes]
-    flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
-    tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
-
-    gu_t, gu_m, eff_t, eff_m = [], [], [], []
-    for k in shapes:
-        r = by_shape[k]
-        M, K, N = r["M"], r["K"], r["N"]
-        useful = 2 * M * K * N
-        tiles = r["tile_count_expected"]
-        gut = useful / (tile_flops * tiles)
-        gu_t.append(gut * 100)
-        rec = r.get("stages", {}).get("GEMM", {}).get("record_count", 0) or tiles
-        gu_m.append((useful / (tile_flops * rec) * 100) if rec else 0.0)
-        m_tiles = (M + TILE_M - 1) // TILE_M
-        n_tiles = (N + TILE_N - 1) // TILE_N
-        n_mn = m_tiles * n_tiles
-        compute_total = tiles * _T_STAGE
-        wall_t = head_ns + compute_total + max(0, n_mn - 1) * dma_w_per_pair
-        eff_t.append((compute_total * gut / wall_t * 100) if wall_t > 0 else 0.0)
-        cw = r.get("composite_window_ns", 0.0) or 0.0
-        eff_m.append((useful / cw / peak_per_ns * 100) if cw > 0 else 0.0)
-
-    series = {
-        "GEMM util % (theoretical)": gu_t,
-        "GEMM util % (measured)":    gu_m,
-        "Theoretical eff %":         eff_t,
-        "Measured eff %":            eff_m,
-    }
-    colors = {
-        "GEMM util % (theoretical)": "#10B981",
-        "GEMM util % (measured)":    "#6EE7B7",
-        "Theoretical eff %":         "#F59E0B",
-        "Measured eff %":            "#3B82F6",
-    }
-    return _grouped_bar_png(
-        "gemm_mac_utilization_theoretical_vs_measured.png",
-        title="GEMM MAC utilization — theoretical vs measured (load_ref)",
-        subtitle=("theoretical model vs simulator op_log; agreement "
-                  "validates the analytical pipeline model."),
-        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
-        series=series, colors=colors, y_label="%", threshold=100.0,
-    )
-
-
-def emit_all_gemm_plots() -> list[str]:
-    """Render every GEMM figure that has data; return the list of paths written."""
-    paths = []
-    for fn in (emit_stage_breakdown,
-               emit_mac_utilization_measured,
-               emit_mac_utilization_theoretical_vs_measured):
-        p = fn()
-        if p:
-            paths.append(p)
-    return paths
+__all__ = [
+    "GEMM_PLOTS_DIR",
+    "GEMM_SWEEP_JSON",
+    "ROOT",
+    "emit_all_gemm_plots",
+    "emit_mac_utilization_measured",
+    "emit_mac_utilization_theoretical_vs_measured",
+    "emit_stage_breakdown",
+]
@@ -0,0 +1,131 @@
+"""Phase 1 spec test for ``ctx.launch`` dim-translation bugs surfaced by
+the multi_user_* panels of milestone-gqa-llama70b (sub-cycle 4c step 2).
+
+The default ``topology.yaml`` has 4×4 = 16 cubes per SIP, so
+``RuntimeContext._num_cubes == 16``. Multi-user attention panels run a
+4-cube ring (validation scale) by passing ``DPPolicy(num_cubes=4)``.
+
+Two bugs in ``ctx.launch`` make this combination silently produce wrong
+kernel arguments:
+
+Bug A — _compute_local_shape ignores DPPolicy.num_cubes
+    ``_compute_local_shape`` in ``ctx.launch`` divides by
+    ``self._num_cubes`` (the topology's cube count, 16) instead of the
+    DPPolicy's effective ``num_cubes`` (4). So a ``(M=80, K=64)`` tensor
+    sharded ``cube="row_wise"`` with ``DPPolicy(num_cubes=4)`` produces
+    a local M of ``80 // 16 = 5``, not the kernel-expected ``80 // 4 = 20``.
+    Note: tensor allocation already honors ``dp.num_cubes`` correctly at
+    [context.py:471-484](src/kernbench/runtime_api/context.py#L471-L484);
+    the bug is the parallel computation inside ``launch`` is out of sync.
+
+Bug B — scalar args coincidentally equal to a global tensor dim get auto-remapped
+    The dim_map at [context.py:712-770](src/kernbench/runtime_api/context.py#L712-L770)
+    is keyed by *value*, so any scalar whose value coincides with a
+    global tensor dim gets rewritten to that dim's local value — even
+    when the scalar is unrelated. ``d_head=64`` coincides with the
+    multi_user K's global M = ``S_kv_per_rank * n = 16 * 4 = 64``, so
+    the kernel receives ``d_head = 16`` (the post-Bug-A local) or
+    ``d_head = 4`` (the pre-Bug-A local) instead of ``64``.
+
+    Legacy bench kernels rely on auto-remap (e.g. ``test_va_offset.py``
+    passes global N and expects the kernel to see local N). The fix is
+    opt-out, not removal: ``ctx.launch(..., _auto_dim_remap=False)``
+    preserves scalars exactly as passed, default behavior unchanged.
+
+Both tests fail today. Phase 2 fixes them in [src/kernbench/runtime_api/context.py](src/kernbench/runtime_api/context.py).
+"""
+from __future__ import annotations
+
+from pathlib import Path
+
+from kernbench.policy.placement.dp import DPPolicy
+from kernbench.runtime_api.context import RuntimeContext
+from kernbench.runtime_api.types import DeviceSelector
+from kernbench.sim_engine.engine import GraphEngine
+from kernbench.topology.builder import load_topology
+
+TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
+
+
+def _make_ctx(corr_id: str) -> RuntimeContext:
+    graph = load_topology(TOPOLOGY_PATH)
+    engine = GraphEngine(graph)
+    return RuntimeContext(
+        engine=engine, target_device=DeviceSelector("sip:0"),
+        correlation_id=corr_id, spec=graph.spec,
+    )
+
+
+def test_topology_num_cubes_is_16_baseline_assumption():
+    """Sanity: confirm the topology this test assumes (16 cubes per SIP).
+    If this fails, recheck the topology.yaml cube_mesh setting before
+    interpreting the other failures below. ``_num_cubes`` is initialized
+    lazily by ``_ensure_allocators`` on first tensor op, so trigger it."""
+    ctx = _make_ctx("dim-baseline")
+    ctx._ensure_allocators()
+    assert ctx._num_cubes == 16, (
+        f"expected default topology.yaml to give 16 cubes per SIP, "
+        f"got {ctx._num_cubes}"
+    )
+
+
+def test_ctx_launch_local_shape_honors_dppolicy_num_cubes():
+    """Bug A. ``DPPolicy(num_cubes=4)`` must be the divisor for
+    row_wise sharding inside ctx.launch's dim_map, not the topology's 16.
+
+    Setup: K-like tensor with M_global = 80 (cleanly divisible by both
+    4 and 16, distinct local values 20 vs 5). Pass M_global as a kernel
+    scalar; the kernel records what it received. With correct dim_map,
+    scalar 80 is remapped to 20 (80 / dp.num_cubes). With current code,
+    it is remapped to 5 (80 / self._num_cubes = 16).
+    """
+    captured: dict[str, int] = {}
+
+    def _kernel(t, m_scalar, *, tl):  # noqa: ARG001
+        captured["m_scalar"] = int(m_scalar)
+
+    ctx = _make_ctx("dim-bugA")
+    dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=4, num_pes=8)
+    t = ctx.zeros((80, 64), dtype="f16", dp=dp, name="t80x64")
+    ctx.launch("bugA_capture", _kernel, t, 80)
+    ctx.wait_all()
+
+    assert "m_scalar" in captured, "kernel was not invoked"
+    assert captured["m_scalar"] == 20, (
+        f"expected dim_map to divide 80 by dp.num_cubes=4 → 20; "
+        f"got {captured['m_scalar']} (likely divided by topology cubes=16)"
+    )
+
+
+def test_ctx_launch_scalar_passed_through_when_auto_remap_disabled():
+    """Bug B. Scalars must not be silently remapped when their value
+    happens to equal a tensor's global dim — at minimum the caller must
+    have an opt-out.
+
+    Setup: K-like tensor with M_global = 64 row_wise. Pass d_head = 64
+    as a scalar (semantically unrelated to K's M, but coincidentally
+    equal). The kernel records d_head. With ``_auto_dim_remap=False``
+    on ctx.launch, d_head must stay 64.
+
+    Today: ``_auto_dim_remap`` kwarg doesn't exist → TypeError. After
+    Phase 2: kwarg exists, defaults to True (legacy unchanged); passing
+    False preserves the scalar.
+    """
+    captured: dict[str, int] = {}
+
+    def _kernel(t, d_head, *, tl):  # noqa: ARG001
+        captured["d_head"] = int(d_head)
+
+    ctx = _make_ctx("dim-bugB")
+    dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=4, num_pes=8)
+    t = ctx.zeros((64, 64), dtype="f16", dp=dp, name="t64x64")
+    ctx.launch(
+        "bugB_capture", _kernel, t, 64,
+        _auto_dim_remap=False,
+    )
+    ctx.wait_all()
+
+    assert captured.get("d_head") == 64, (
+        f"expected d_head scalar to pass through unchanged when "
+        f"_auto_dim_remap=False; got {captured.get('d_head')!r}"
+    )
@@ -0,0 +1,77 @@
+"""Milestone benches: registration + figure/result generation (ADR-0054).
+
+``milestone-1h-gemm`` / ``milestone-1h-ccl`` are eval benches: run via the
+normal ``run_bench`` path, they regenerate every GEMM / allreduce figure +
+CSV into ``benches/1H_milestone_output/{gemm,ccl}/``. The GEMM bench in
+``MILESTONE_FAST=1`` mode just re-renders the committed sweep JSON (fast,
+default-run here); the CCL bench drives both full sweeps (slow, opt-in).
+"""
+from __future__ import annotations
+
+import re
+from pathlib import Path
+
+import pytest
+
+from kernbench.benches.registry import resolve
+from kernbench.runtime_api.bench_runner import run_bench
+from kernbench.runtime_api.types import resolve_device
+from kernbench.sim_engine.engine import GraphEngine
+from kernbench.topology.builder import resolve_topology
+
+import kernbench.benches.milestone_1h_ccl as ccl_bench
+import kernbench.benches.milestone_1h_gemm as gemm_bench
+
+_NAME_RE = re.compile(r"^[a-z][a-z0-9]*(-[a-z0-9]+)*$")
+
+
+def _run(name: str):
+    topo = resolve_topology("topology.yaml")
+    return run_bench(
+        topology=topo, bench_fn=resolve(name).run, device=resolve_device(None),
+        engine_factory=lambda t, d: GraphEngine(
+            getattr(t, "topology_obj", t), enable_data=True,
+        ),
+    )
+
+
+def test_milestone_benches_registered():
+    for name in ("milestone-1h-gemm", "milestone-1h-ccl"):
+        spec = resolve(name)
+        assert spec.name == name
+        assert _NAME_RE.match(spec.name)
+        assert spec.description.strip()
+
+
+@pytest.mark.skipif(
+    not gemm_bench.DEFAULT_SWEEP_JSON.exists(),
+    reason="gemm_sweep.json absent; run scripts/gemm_sweep.py first",
+)
+def test_milestone_gemm_fast_generates_figures(monkeypatch):
+    monkeypatch.setenv("MILESTONE_FAST", "1")
+    result = _run("milestone-1h-gemm")
+    assert result.completion.ok, result.completion
+    out = gemm_bench._OUTPUT_DIR
+    for png in (
+        "gemm_stage_breakdown.png",
+        "gemm_mac_utilization_measured.png",
+        "gemm_mac_utilization_theoretical_vs_measured.png",
+    ):
+        assert (out / png).exists(), f"missing {png}"
+
+
+@pytest.mark.slow
+def test_milestone_ccl_generates_figures():
+    result = _run("milestone-1h-ccl")
+    assert result.completion.ok, result.completion
+    out = ccl_bench._OUTPUT_DIR
+    for artifact in (
+        "summary.csv",
+        "topology.png",
+        "comparison_mesh_vs_ring_vs_2DTorus_vs_theoretical_vs_fsim.png",
+        "AllReduce_LRAB_2Dtorus_6SiP_2x3_with_TCM_SRAM_HBM.png",
+        "AllReduce_LRAB_Ring1D_6SiP_1x6.png",
+        "AllReduce_LRAB_2Dtorus_6SiP_2x3.png",
+        "AllReduce_LRAB_2DMesh_6SiP_2x3.png",
+    ):
+        assert (out / artifact).exists(), f"missing {artifact}"
@@ -0,0 +1,218 @@
+"""Phase 1 spec test for the math-input snapshot race (IPCQ slot wrap).
+
+Context (sub-cycle 4c.0 diagnostic):
+
+  The mesh decode kernel (_attention_mesh_mlo.py) issues many tl.recv()
+  calls against an IPCQ ring of ~8 slots. With n_ranks=8 and bidirectional
+  fan-out, each PE issues 3 recvs per step × 7 steps × 2 directions =
+  42 recvs per panel. The IPCQ slot index is ``my_tail % n_slots``, so
+  the ring wraps and a fresh recv overwrites a slot whose data a prior
+  math op had not yet snapshotted.
+
+  OpLogger.record_end currently snapshots math inputs by re-reading
+  MemoryStore at record_end time (op_log.py:97-113). When a later recv
+  has overwritten the input addr with a DIFFERENT-shape array between
+  record_start and record_end, MemoryStore.read raises
+  ``Shape mismatch: stored (16, 64) vs requested (16, 1)`` and the
+  snapshot becomes None (or, in Phase 2 replay, surfaces the same
+  exception in DataExecutor).
+
+Phase 1 expectation: this test currently fails. It asserts the
+*desired* behavior: when the math input TensorHandle carries a
+.data snapshot (captured at recv time before the slot was wrapped),
+OpLogger MUST prefer that snapshot over MemoryStore.read.
+
+After Phase 2 (snapshot propagation fix), this test passes — and the
+sub-cycle 4c.0 mesh decode end-to-end (test_attention_mesh_decode_diag
+and test_milestone_gqa_llama70b) passes for the same reason.
+
+See: docs/adr/ADR-0020 (two-phase execution),
+     docs/adr/ADR-0023 (IPCQ ring slots),
+     docs/adr/ADR-0027 (snapshot discipline for dma_write).
+"""
+from __future__ import annotations
+
+import numpy as np
+
+from kernbench.common.pe_commands import MathCmd, TensorHandle
+from kernbench.sim_engine.memory_store import MemoryStore
+from kernbench.sim_engine.op_log import OpLogger
+
+
+# ── Helpers ──────────────────────────────────────────────────────
+
+
+def _slot_handle(addr: int, shape: tuple[int, ...], dtype: str,
+                 data: np.ndarray | None) -> TensorHandle:
+    """Build a TensorHandle as tl.recv() would: addr=slot, .data=snapshot."""
+    nbytes = int(np.prod(shape)) * np.dtype(
+        {"f16": np.float16, "f32": np.float32}[dtype]
+    ).itemsize
+    return TensorHandle(
+        id=f"slot_{addr:x}", addr=addr, shape=shape, dtype=dtype,
+        nbytes=nbytes, data=data, space="tcm",
+    )
+
+
+def _out_handle(addr: int, shape: tuple[int, ...], dtype: str) -> TensorHandle:
+    nbytes = int(np.prod(shape)) * np.dtype(
+        {"f16": np.float16, "f32": np.float32}[dtype]
+    ).itemsize
+    return TensorHandle(
+        id=f"out_{addr:x}", addr=addr, shape=shape, dtype=dtype,
+        nbytes=nbytes, data=None, space="tcm",
+    )
+
+
+# ── Tests ─────────────────────────────────────────────────────────
+
+
+def test_math_snapshot_lost_when_input_slot_overwritten_with_same_nbytes():
+    """Baseline (passes today): if a later write at the input addr has the
+    SAME nbytes as the math input's expected shape, MemoryStore.read
+    returns the LATER data — the snapshot is silently wrong. This is the
+    quiet variant of the bug; it does not raise, it just produces
+    incorrect numerical output in Phase 2.
+
+    This test documents that the current OpLogger behavior is wrong even
+    when shapes coincidentally match. The Phase 2 fix removes this
+    silent-corruption mode by preferring handle.data.
+    """
+    store = MemoryStore()
+    slot_addr = 0x3000
+    # Original at recv time: filled with 7s.
+    original = np.full((16, 1), 7.0, dtype=np.float16)
+    store.write("tcm", slot_addr, original)
+
+    inp = _slot_handle(slot_addr, (16, 1), "f16", data=original.copy())
+    out = _out_handle(0x4000, (16, 1), "f16")
+    cmd = MathCmd(op="maximum", inputs=(inp,), out=out)
+
+    logger = OpLogger(memory_store=store)
+    logger.record_start(10.0, "sip0.cube0.pe0.pe_math", cmd)
+
+    # SIMULATE: a later recv writes a DIFFERENT array at the same slot
+    # (same nbytes as (16,1), so MemoryStore.read does not raise).
+    later = np.full((16, 1), 99.0, dtype=np.float16)
+    store.write("tcm", slot_addr, later)
+
+    logger.record_end(15.0, "sip0.cube0.pe0.pe_math", cmd)
+
+    snap = logger.records[0].params["input_snapshots"][0]
+    assert snap is not None
+    # Desired post-fix behavior: snapshot equals ``original``.
+    # Today: snapshot equals ``later`` — silent corruption.
+    np.testing.assert_array_equal(snap, original)
+
+
+def test_math_snapshot_survives_input_slot_wrap_with_different_shape():
+    """The hard-failure variant: a later recv overwrites the input slot
+    with a DIFFERENT-shape array (different nbytes), so MemoryStore.read
+    at record_end raises and the snapshot becomes None. Phase 2 replay
+    then surfaces this as the (16, 64) vs (16, 1) crash seen in
+    test_attention_mesh_decode_diag.
+
+    Desired behavior: handle.data carries the recv-time snapshot, so
+    OpLogger never has to look at MemoryStore for this input → no race,
+    snapshot is correct.
+    """
+    store = MemoryStore()
+    slot_addr = 0x3000
+
+    # Original at recv time: an (m, ℓ) reduction result, shape (16, 1).
+    original = np.full((16, 1), 7.0, dtype=np.float16)
+    store.write("tcm", slot_addr, original)
+
+    inp = _slot_handle(slot_addr, (16, 1), "f16", data=original.copy())
+    out = _out_handle(0x4000, (16, 1), "f16")
+    cmd = MathCmd(op="maximum", inputs=(inp,), out=out)
+
+    logger = OpLogger(memory_store=store)
+    logger.record_start(10.0, "sip0.cube0.pe0.pe_math", cmd)
+
+    # SIMULATE the slot-wrap race: a later recv (an o triplet, shape
+    # (16, 64)) writes the same TCM slot. MemoryStore.read for shape
+    # (16, 1) now raises ValueError("Shape mismatch ...").
+    overwrite = np.full((16, 64), 99.0, dtype=np.float16)
+    store.write("tcm", slot_addr, overwrite)
+
+    logger.record_end(15.0, "sip0.cube0.pe0.pe_math", cmd)
+
+    snap = logger.records[0].params["input_snapshots"][0]
+    # Today: snap is None (read raised, except branch returned None).
+    # Post-fix: handle.data preferred → snap is original.
+    assert snap is not None, (
+        "input snapshot was lost when the recv slot was wrapped — "
+        "OpLogger must prefer handle.data over MemoryStore.read for "
+        "math inputs whose handle carries a .data snapshot"
+    )
+    assert snap.shape == (16, 1)
+    np.testing.assert_array_equal(snap, original)
+
+
+def test_math_snapshot_handle_data_with_multiple_inputs():
+    """maximum/binary math has 2 inputs; both must use their carried
+    snapshots independently (e.g. m_running merged with m_from_W where
+    only m_from_W came from a recv slot)."""
+    store = MemoryStore()
+
+    # Input 0: a running m value held in PE scratch (no .data; OpLogger
+    # falls back to MemoryStore.read as today). Its addr is stable —
+    # not subject to the slot-wrap race.
+    scratch_addr = 0x5000
+    m_running = np.full((16, 1), 3.0, dtype=np.float16)
+    store.write("tcm", scratch_addr, m_running)
+    inp0 = _slot_handle(scratch_addr, (16, 1), "f16", data=None)
+
+    # Input 1: m_from_W via tl.recv — carries snapshot in .data, addr
+    # is the recv slot which WILL be wrapped before record_end.
+    slot_addr = 0x3000
+    m_from_W = np.full((16, 1), 7.0, dtype=np.float16)
+    store.write("tcm", slot_addr, m_from_W)
+    inp1 = _slot_handle(slot_addr, (16, 1), "f16", data=m_from_W.copy())
+
+    out = _out_handle(0x4000, (16, 1), "f16")
+    cmd = MathCmd(op="maximum", inputs=(inp0, inp1), out=out)
+
+    logger = OpLogger(memory_store=store)
+    logger.record_start(10.0, "sip0.cube0.pe0.pe_math", cmd)
+
+    # Slot 0x3000 gets wrapped by a later recv with a different shape.
+    overwrite = np.full((16, 64), 99.0, dtype=np.float16)
+    store.write("tcm", slot_addr, overwrite)
+
+    logger.record_end(15.0, "sip0.cube0.pe0.pe_math", cmd)
+
+    snaps = logger.records[0].params["input_snapshots"]
+    assert len(snaps) == 2
+    # Input 0 (no carried snapshot, addr stable): MemoryStore read still
+    # works. This must keep working post-fix.
+    assert snaps[0] is not None
+    np.testing.assert_array_equal(snaps[0], m_running)
+    # Input 1 (carried snapshot, slot wrapped): must come from .data.
+    assert snaps[1] is not None
+    assert snaps[1].shape == (16, 1)
+    np.testing.assert_array_equal(snaps[1], m_from_W)
+
+
+def test_math_snapshot_falls_back_to_memory_store_when_handle_data_is_none():
+    """Backward-compat: handles with .data=None must continue to use
+    MemoryStore.read as today. Most math inputs (intermediate results
+    from local tl.dot / tl.exp etc.) have data=None and their TCM addrs
+    are stable for the kernel's lifetime."""
+    store = MemoryStore()
+    addr = 0x6000
+    arr = np.full((8, 8), 2.0, dtype=np.float16)
+    store.write("tcm", addr, arr)
+
+    inp = _slot_handle(addr, (8, 8), "f16", data=None)
+    out = _out_handle(0x7000, (8, 8), "f16")
+    cmd = MathCmd(op="exp", inputs=(inp,), out=out)
+
+    logger = OpLogger(memory_store=store)
+    logger.record_start(10.0, "sip0.cube0.pe0.pe_math", cmd)
+    logger.record_end(15.0, "sip0.cube0.pe0.pe_math", cmd)
+
+    snap = logger.records[0].params["input_snapshots"][0]
+    assert snap is not None
+    np.testing.assert_array_equal(snap, arr)
@@ -0,0 +1,334 @@
+"""Generate docs/adr/INDEX.md (and docs/adr-ko/INDEX.md) from the ADR corpus.
+
+Auto-derives a section-based index following the same classification as
+the /report skill — Design Principles / High-level Architecture /
+Detailed Architecture (by component) / Implementation Decisions
+(by topic). Run before publishing to refresh INDEX.md.
+
+The classification table below is the single source of truth. When a new
+ADR is added under docs/adr/, append an entry to ``CLASSIFICATION``. The
+script exits 1 if any ADR file is missing from the table or any title
+cannot be parsed, so omissions surface in CI.
+
+Usage:
+    python tools/generate_adr_index.py [--root <repo-root>] [--check]
+
+  --check : exit 1 if the generated INDEX differs from the on-disk file
+            (used by CI to detect un-regenerated indexes).
+"""
+
+from __future__ import annotations
+
+import argparse
+import re
+import sys
+from pathlib import Path
+
+ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-([a-z0-9_-]+)\.md$")
+# Title separator may be ":" (most ADRs) or "—" (em-dash; ADR-0033 uses
+# this). The verifier (tools/verify_adr_lang_pairs.py) only checks the
+# number, so both styles already coexist in the corpus.
+TITLE_RE = re.compile(r"^# ADR-(\d{4})\s*[:—]\s*(.+?)\s*$")
+
+DESIGN_PRINCIPLES = "Design Principles"
+HIGH_LEVEL = "High-level Architecture"
+DETAILED = "Detailed Architecture"
+IMPL_DECISIONS = "Implementation Decisions"
+
+
+# (section, subgroup) per ADR. subgroup is used to sub-divide Detailed
+# (by component, see DETAILED_COMPONENTS) and Implementation (by topic).
+# Add a line here when introducing a new ADR.
+CLASSIFICATION: dict[int, tuple[str, str | None]] = {
+    # Design Principles
+    13: (DESIGN_PRINCIPLES, None),
+    33: (DESIGN_PRINCIPLES, None),
+
+    # High-level Architecture
+    3:  (HIGH_LEVEL, "System hierarchy (Tray / SIP / CUBE / PE)"),
+    7:  (HIGH_LEVEL, "Runtime API ↔ sim_engine boundaries"),
+    16: (HIGH_LEVEL, "IOChiplet NOC and memory data path"),
+    17: (HIGH_LEVEL, "Cube NOC and HBM connectivity"),
+
+    # Detailed Architecture (subgroup matches DETAILED_COMPONENTS entries)
+    14: (DETAILED, "pe_pipeline"),  # covers pe_cpu/pe_dma/pe_fetch_store/pe_gemm/pe_math/pe_scheduler
+    23: (DETAILED, "pe_ipcq"),
+    34: (DETAILED, "hbm_ctrl"),
+    35: (DETAILED, "m_cpu"),
+    36: (DETAILED, "io_cpu"),
+    37: (DETAILED, "forwarding"),
+    38: (DETAILED, "pcie_ep"),
+    39: (DETAILED, "pe_mmu"),
+    40: (DETAILED, "pe_tcm"),
+    41: (DETAILED, "sram"),
+    42: (DETAILED, "tiling"),
+
+    # Implementation Decisions
+    1:  (IMPL_DECISIONS, "Address Scheme"),
+    2:  (IMPL_DECISIONS, "Routing & Helper API"),
+    4:  (IMPL_DECISIONS, "Memory Semantics & Local-HBM Bandwidth"),
+    5:  (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
+    6:  (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
+    8:  (IMPL_DECISIONS, "Tensor Deployment and Allocation"),
+    9:  (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"),
+    10: (IMPL_DECISIONS, "CLI Surface and Semantics"),
+    11: (IMPL_DECISIONS, "Address Scheme"),
+    12: (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"),
+    15: (IMPL_DECISIONS, "Component Port/Wire Fabric Model"),
+    20: (IMPL_DECISIONS, "Two-Pass Data Execution"),
+    22: (IMPL_DECISIONS, "2D Grid Program Identity"),
+    24: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
+    25: (IMPL_DECISIONS, "IPCQ Direction Addressing"),
+    26: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
+    27: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
+    32: (IMPL_DECISIONS, "Intercube All-Reduce"),
+    43: (IMPL_DECISIONS, "Evaluation Harnesses"),
+    44: (IMPL_DECISIONS, "Evaluation Harnesses"),
+    45: (IMPL_DECISIONS, "Bench Module Contract"),
+    46: (IMPL_DECISIONS, "Kernel-side tl.* API (TLContext)"),
+    47: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
+    48: (IMPL_DECISIONS, "Memory Allocator Algorithms"),
+    49: (IMPL_DECISIONS, "Probe Subcommand"),
+    50: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
+    51: (IMPL_DECISIONS, "Routing & Helper API"),
+    52: (IMPL_DECISIONS, "Sim-engine Op Log and Memory Store Schemas"),
+    53: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
+    54: (IMPL_DECISIONS, "Evaluation Harnesses"),
+}
+
+# Canonical component order for the Detailed Architecture section.
+# Each entry: (component_name, list[ADR-numbers that cover it]).
+# Order matches src/kernbench/components/builtin/*.py alphabetical
+# (the same order /report uses).
+DETAILED_COMPONENTS: list[tuple[str, list[int]]] = [
+    ("forwarding",      [37]),
+    ("hbm_ctrl",        [34]),
+    ("io_cpu",          [36]),
+    ("m_cpu",           [35]),
+    ("pcie_ep",         [38]),
+    ("pe_cpu",          [14]),
+    ("pe_dma",          [14, 23]),
+    ("pe_fetch_store",  [14]),
+    ("pe_gemm",         [14]),
+    ("pe_ipcq",         [23]),
+    ("pe_math",         [14]),
+    ("pe_mmu",          [39]),
+    ("pe_scheduler",    [14]),
+    ("pe_tcm",          [40]),
+    ("sram",            [41]),
+    ("tiling",          [42]),
+]
+
+
+def _strip_bom(text: str) -> str:
+    """Strip leading UTF-8 BOM if present."""
+    if text and ord(text[0]) == 0xFEFF:
+        return text[1:]
+    return text
+
+
+def _find_adrs(adr_dir: Path) -> list[tuple[int, str, Path]]:
+    """Return [(num, slug, path), ...] for ADR files in adr_dir, sorted by num."""
+    out: list[tuple[int, str, Path]] = []
+    for p in sorted(adr_dir.iterdir()):
+        if not p.is_file():
+            continue
+        m = ADR_FILENAME_RE.match(p.name)
+        if not m:
+            continue
+        out.append((int(m.group(1)), m.group(2), p))
+    out.sort(key=lambda t: t[0])
+    return out
+
+
+def _extract_title(path: Path) -> str:
+    """Parse the title from the first line `# ADR-NNNN: <title>`. Strips BOM."""
+    text = _strip_bom(path.read_text(encoding="utf-8"))
+    first_line = text.split("\n", 1)[0] if text else ""
+    m = TITLE_RE.match(first_line)
+    if not m:
+        raise ValueError(
+            f"{path.name}: cannot parse title from first line: {first_line!r}"
+        )
+    return m.group(2)
+
+
+def _build_index(adr_dir: Path, link_prefix: str) -> str:
+    """Build the INDEX.md text for adr_dir.
+
+    link_prefix is the relative href used for ADR links (e.g., ``./``
+    so links resolve relative to the INDEX file location).
+    """
+    adrs = _find_adrs(adr_dir)
+    if not adrs:
+        raise RuntimeError(f"No ADR files found under {adr_dir}")
+
+    # Validate every ADR is classified.
+    missing = sorted(num for num, _slug, _ in adrs if num not in CLASSIFICATION)
+    if missing:
+        raise RuntimeError(
+            "ADR(s) missing from CLASSIFICATION table in "
+            "tools/generate_adr_index.py: "
+            + ", ".join(f"ADR-{n:04d}" for n in missing)
+            + ". Add an entry for each."
+        )
+
+    # Map: num → (filename, title)
+    num_to_meta: dict[int, tuple[str, str]] = {}
+    for num, _slug, path in adrs:
+        num_to_meta[num] = (path.name, _extract_title(path))
+
+    # ── Section assembly ────────────────────────────────────────────
+    lines: list[str] = []
+    lines.append("# ADR Index")
+    lines.append("")
+    lines.append(
+        f"Auto-generated by `tools/generate_adr_index.py`. "
+        f"Total ADRs: **{len(adrs)}**."
+    )
+    lines.append("")
+    lines.append(
+        "Classification mirrors the `/report` skill's section assignment. "
+        "When adding a new ADR, also add an entry to the "
+        "`CLASSIFICATION` table in `tools/generate_adr_index.py`."
+    )
+    lines.append("")
+
+    def fmt_entry(num: int) -> str:
+        fname, title = num_to_meta[num]
+        return f"- [ADR-{num:04d}]({link_prefix}{fname}) — {title}"
+
+    # Design Principles
+    lines.append("## Design Principles")
+    lines.append("")
+    nums = sorted(n for n, (sec, _) in CLASSIFICATION.items()
+                  if sec == DESIGN_PRINCIPLES and n in num_to_meta)
+    for n in nums:
+        lines.append(fmt_entry(n))
+    lines.append("")
+
+    # High-level Architecture (preserve declaration order via CLASSIFICATION dict's insertion order)
+    lines.append("## High-level Architecture")
+    lines.append("")
+    nums = sorted(n for n, (sec, _) in CLASSIFICATION.items()
+                  if sec == HIGH_LEVEL and n in num_to_meta)
+    for n in nums:
+        sub = CLASSIFICATION[n][1] or ""
+        fname, title = num_to_meta[n]
+        if sub:
+            lines.append(
+                f"- [ADR-{n:04d}]({link_prefix}{fname}) — {title}"
+                f"  _({sub})_"
+            )
+        else:
+            lines.append(fmt_entry(n))
+    lines.append("")
+
+    # Detailed Architecture (canonical component order)
+    lines.append("## Detailed Architecture")
+    lines.append("")
+    lines.append("One subsection per component file under `src/kernbench/components/builtin/`.")
+    lines.append("")
+    for comp, adr_nums in DETAILED_COMPONENTS:
+        lines.append(f"### {comp}")
+        lines.append("")
+        if adr_nums:
+            for n in adr_nums:
+                if n not in num_to_meta:
+                    raise RuntimeError(
+                        f"DETAILED_COMPONENTS references ADR-{n:04d} for "
+                        f"'{comp}' but no such ADR file exists."
+                    )
+                lines.append(fmt_entry(n))
+        else:
+            lines.append("_(no ADR coverage)_")
+        lines.append("")
+
+    # Implementation Decisions — group by subgroup, preserving first-appearance order.
+    lines.append("## Implementation Decisions")
+    lines.append("")
+    topic_order: list[str] = []
+    topic_to_nums: dict[str, list[int]] = {}
+    for n, (sec, sub) in CLASSIFICATION.items():
+        if sec != IMPL_DECISIONS or n not in num_to_meta:
+            continue
+        topic = sub or "Uncategorized"
+        if topic not in topic_to_nums:
+            topic_order.append(topic)
+            topic_to_nums[topic] = []
+        topic_to_nums[topic].append(n)
+    # Stable order: by smallest ADR-number in topic, so older infra appears first.
+    topic_order.sort(key=lambda t: min(topic_to_nums[t]))
+    for topic in topic_order:
+        lines.append(f"### {topic}")
+        lines.append("")
+        for n in sorted(topic_to_nums[topic]):
+            lines.append(fmt_entry(n))
+        lines.append("")
+
+    return "\n".join(lines).rstrip() + "\n"
+
+
+def _check_or_write(path: Path, content: str, check: bool) -> bool:
+    """Write content to path, or compare in --check mode. Returns True on diff."""
+    existing = path.read_text(encoding="utf-8") if path.exists() else ""
+    if check:
+        if existing != content:
+            print(f"[diff] {path} would change.")
+            return True
+        return False
+    path.write_text(content, encoding="utf-8")
+    if existing != content:
+        print(f"[wrote] {path}")
+    else:
+        print(f"[unchanged] {path}")
+    return False
+
+
+def main(argv: list[str] | None = None) -> int:
+    p = argparse.ArgumentParser(description=__doc__)
+    p.add_argument(
+        "--root", type=Path, default=Path.cwd(),
+        help="Repository root (default: cwd)",
+    )
+    p.add_argument(
+        "--check", action="store_true",
+        help="Exit 1 if generated INDEX would differ from disk",
+    )
+    args = p.parse_args(argv)
+
+    en_dir = args.root / "docs" / "adr"
+    ko_dir = args.root / "docs" / "adr-ko"
+
+    if not en_dir.is_dir():
+        print(f"error: {en_dir} does not exist", file=sys.stderr)
+        return 1
+
+    any_diff = False
+    try:
+        en_index = _build_index(en_dir, link_prefix="./")
+    except (RuntimeError, ValueError) as e:
+        print(f"error (EN): {e}", file=sys.stderr)
+        return 1
+    any_diff |= _check_or_write(en_dir / "INDEX.md", en_index, args.check)
+
+    if ko_dir.is_dir():
+        try:
+            ko_index = _build_index(ko_dir, link_prefix="./")
+        except (RuntimeError, ValueError) as e:
+            print(f"error (KO): {e}", file=sys.stderr)
+            return 1
+        any_diff |= _check_or_write(ko_dir / "INDEX.md", ko_index, args.check)
+
+    if args.check and any_diff:
+        print(
+            "INDEX.md is out of date. "
+            "Run `python tools/generate_adr_index.py` to refresh.",
+            file=sys.stderr,
+        )
+        return 1
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
Author	SHA1	Message	Date
mukesh	222815d374	attention: add rank_axis kwarg to mesh kernels for multi_user cube ring ADR-0059 single_user_* panels run the ring across PEs in one cube (rank == tl.program_id(axis=0)). multi_user_* panels run the ring across cubes — rank should be cube_id (axis=1), and 7 of every 8 PEs in each cube must stay silent because the cube-level SFR install only gives the cube-coordinate PE 0 an E/W neighbor. Add ``rank_axis: int = 0`` kwarg to both ``attention_mesh_mlo_kernel`` and ``attention_mesh_kv_kernel``: - 0 (default): rank == tl.program_id(axis=0). Existing single_user behavior, all spec tests unchanged. - 1: gate ``if tl.program_id(axis=0) != 0: return`` at kernel start, then ``rank = tl.program_id(axis=1)``. multi_user_* panels pass this to the kernel via ctx.launch positional arg. Also brings in _attention_mesh_kv.py and _attention_mesh_mlo.py as the committed home of the ADR-0059 kernels (previously living uncommitted in the working tree from sub-cycle 4b). Tests: 7-test rank_axis spec file (default-path + rank_axis=1 gating and cube-id semantics, both kernels); 4-panel diag harness now green end-to-end (single_user_prefill/decode + multi_user_prefill/decode); 763-test wider sweep clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-01 19:53:18 -07:00
mukesh	d9e767d048	runtime_api: ctx.launch honors DPPolicy.num_cubes + adds _auto_dim_remap opt-out Two compounding bugs in ctx.launch's dim-translation path surfaced by multi_user_* panels of milestone-gqa-llama70b (sub-cycle 4c step 2): Bug A: _compute_local_shape divided by self._num_cubes (the topology's cube count, 16 in default topology.yaml) instead of the DPPolicy's effective num_cubes (4 for validation-scale multi_user). The tensor allocator at context.py:471-484 already honored dp.num_cubes; the parallel computation inside launch was out of sync. Fix mirrors the allocator's eff_num_cubes precedence pattern. Bug B: dim_map was keyed by value, so any scalar whose value coincidentally equaled a global tensor dim got rewritten to that dim's local value — e.g. d_head=64 colliding with K's global M=64 in multi_user mode. Legacy bench kernels (va_offset etc.) rely on this remap, so the fix is opt-out: ctx.launch(..., _auto_dim_remap=False) preserves scalars exactly as passed. Default remains True. Tests: 3 new dim-translation tests + 4-panel diag harness covers single_user_* (PASS) and multi_user_* (advances to new SFR/axis layer failure, tracked separately). va_offset + full attention spec suite unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-01 19:33:40 -07:00
mukesh	313dee503c	sim_engine: fix IPCQ slot-wrap snapshot race in Phase 2 replay Phase 1 cannot snapshot math-output sources at outbound send time because math executes only in Phase 2 — so token.data stays None and PE_DMA inbound can't write the recv slot. For own-sends this is harmless (Phase 2 replay reads the stable scratch addr after math runs). For forwarded sends in mesh kernels (ADR-0059), src_addr is a recv slot that gets wrapped by later inbounds before this read's Phase 2 turn, yielding a shape mismatch on the fallback MemoryStore.read. Fix: DataExecutor maintains a per-slot, time-ordered, shape-keyed history. Every ipcq_copy write appends (t_write, value) to the slot's history; _resolve_read falls back to the most recent shape-matching entry with t_write <= the consuming op's t_start. Applied uniformly to _execute_memory, _execute_gemm, and _execute_math. Secondary: OpLogger.record_end for math ops now prefers TensorHandle.data carried by the input handle over a MemoryStore re-read, closing the smaller record-end race covered by the new test_op_log_input_snapshot_race.py unit tests. Tests: 4 new race tests + 6 existing op_log + mesh decode diag + mesh kv/mlo spec — all green. Full repo sweep: 760 passed (3 pre-existing failures unrelated: bench-registry list drift + Windows Tkinter env). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-01 19:14:09 -07:00
mukesh	b1d6fafd3a	eval: commit milestone bench output (track generated figures + results) Per request, the milestone bench output is now tracked in git instead of gitignored, so the figures/results are viewable on the remote: - src/kernbench/benches/1H_milestone_output/gemm/ (3 PNGs + gemm_sweep.json) - src/kernbench/benches/1H_milestone_output/ccl/ (3 per-topology PNGs, buffer-kind PNG+CSV, FSIM comparison PNG, topology.png, summary.csv) Drop the .gitignore rule; update ADR-0054 D3 + Negative (EN+KO) to say the output is committed (regenerable by rerunning the bench). Artifacts produced by full bench runs (milestone-1h-gemm non-FAST, milestone-1h-ccl). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 15:37:27 -07:00
mukesh	cc1bbd0ab7	eval: fold GEMM/allreduce harnesses into self-contained milestone benches Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/ into two self-contained eval benches so a user can regenerate every result + figure with one command: kernbench run --bench milestone-1h-gemm (MILESTONE_FAST=1 reuses JSON) kernbench run --bench milestone-1h-ccl - benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the run(torch) entry drives the sweeps and writes figures into benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a sentinel tensor to satisfy the run_bench contract. - tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin re-export/wrapper shims over the benches (single source preserved); the pytest-only param builders + _run_distributed wrapper stay in the shim. - eval-bench pattern: a bench may drive many configs + build its own per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2). ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI Semantics amended; ADR INDEX regenerated. Verified: milestone benches run clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 15:19:52 -07:00
ywkang	e33e76f2d1	adr: add INDEX.md (auto-generated by tools/generate_adr_index.py) Adds a section-based table of contents for the 46-ADR corpus, mirroring the /report skill's classification (Design Principles / High-level Architecture / Detailed Architecture by component / Implementation Decisions by topic). Generated for both docs/adr/ (EN titles) and docs/adr-ko/ (KO titles) from one tool. tools/generate_adr_index.py: - Single CLASSIFICATION dict per ADR — add an entry when introducing a new ADR; the script fails loud if any file is missing from the table. - DETAILED_COMPONENTS lists each builtin component and the ADR(s) that cover it (ADR-0014 appears under six PE engines; ADR-0023 under pe_dma + pe_ipcq). - Accepts both ":" and "—" title separators (matching ADR-0033's existing format). - --check mode for CI: exits 1 if INDEX.md is stale. Also includes the docs/report/architecture-2026-1H.md generated by the prior /report write (the public-facing architecture document; 836 lines, 76 source-attribution comments). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 11:15:37 -07:00
ywkang	bd49c93703	adr: add ADR-0050-0053 — close /report's second-pass G4 candidates Documents four cross-cutting surfaces one layer deeper than the prior G4 batch: - 0050 par-ccl-algorithm-module-contract: how to author a new CCL algorithm in src/kernbench/ccl/algorithms/. Pairs with ADR-0045's bench-module contract. Pins the four required public symbols (kernel, kernel_args, TOPO_NAME_TO_KIND constants, kernel alias), the 9 + tl standardized kernel signature, the kernel_args tuple format, sip_topo_kind dispatch, and the ccl.yaml entry workflow. - 0051 lat-routing-helper-api: every public method of AddressResolver (resolve, find_m_cpu, find_pcie_ep, find_io_cpu, find_all_pcie_eps) and PathRouter (find_path, find_path_with_distance, find_mcpu_dma_path, find_memory_path, find_node_path + 2 shims). Pins the four adjacency graphs (_adj_all / _adj / _adj_mcpu_dma / _adj_local) and the edge-kind exclusion sets they use, plus the single-owner naming convention. - 0052 dev-oplog-memory-store-schemas: OpRecord's 7 fields, the per-op_name params matrix (dma_read, dma_write, gemm_*, math, math reduction, composite_gemm, ipcq_copy, unknown), snapshot timing rules (math = all inputs, dma_write = HBM-only — ADR-0027 race avoidance), TileToken stage_type capture, and MemoryStore's (space, addr) two-level dict with reference-store semantics. - 0053 dev-topology-builder-algorithms: the 6-stage compile pipeline, cube_mesh.yaml's source_hash cache and its 5 input fields, the cube NoC auto-layout algorithm (row/col placement, HBM exclusion zone, PE/M_CPU/SRAM attachment via nearest-router, UCIe N/S/E/W distribution), the node naming convention (single-owner with router.py), the edge-kind catalog, the 4 view projections, and a table of spec-field changes vs mesh regeneration. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 10:52:42 -07:00
ywkang	9a02955770	adr: add ADR-0046-0049 — close G4 coverage gaps from /report Documents four cross-cutting surfaces that previously had no ADR backing, each surfaced as a G4 candidate by /report: - 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...), the two execution modes (command-list vs greenlet runner), scratch allocator semantics, dispatch-overhead model, and the kernel registry. - 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group (backend="ahbm") install path. world_size priority (algorithm > defaults > topology), the 4-step init sequence (load ccl.yaml, import algorithm module, derive world_size, install SFR + IPCQ), greenlet- local rank registry, all_reduce dispatch via _defer_wait, barrier no-op rationale, and the explicit list of unsupported dist.* APIs. - 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator free-list semantics. Offset-keyed first-fit with coalescing, the no-validation trust model for free(), HBM/TCM channel separation, page-aligned VA allocation, the page_size dual-default (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and one-allocator-per-sub-unit rule. - 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog. H2D / D2H / PE DMA categories with their exact cube-index choices, the 32 KiB reference size, the 5-point utilization sweep, the formula vs actual column meanings, automatic invariant checks (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine isolation, and the human-readable (not machine-parsable) output contract. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 10:25:04 -07:00
ywkang	5f8dd688f5	adr: add ADR-0045 (bench module contract — registration, dispatch, authoring) Documents src/kernbench/benches/: how @bench registration + audit work, how the CLI dispatches via run_bench/RuntimeContext, and the contract a new bench module must satisfy. Nine decisions (D1-D9) cover: - @bench name/description rules and duplicate detection - Module-file convention (_-prefixed helpers vs bench modules) - def run(torch) signature; torch = RuntimeContext - Minimum-one-submit rule (else NO_REQUESTS) - Single-device convention + multi-SIP CCL exception (ADR-0024/0027) - resolve() name/index decision tree; indices are not a stable API - Exact RuntimeContext surface exposed to benches - Env-var parameterization (matmul_composite / gemm_sweep.py pattern) Four alternatives rejected with documented reasons (manifest YAML, decorator entry= arg, @multi_device_bench split, stable indices). Verifier (tools/verify_adr_lang_pairs.py) passes for EN/KO pair. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 16:29:45 -07:00