attention: add rank_axis kwarg to mesh kernels for multi_user cube ring

ADR-0059 single_user_* panels run the ring across PEs in one cube (rank == tl.program_id(axis=0)). multi_user_* panels run the ring across cubes — rank should be cube_id (axis=1), and 7 of every 8 PEs in each cube must stay silent because the cube-level SFR install only gives the cube-coordinate PE 0 an E/W neighbor. Add ``rank_axis: int = 0`` kwarg to both ``attention_mesh_mlo_kernel`` and ``attention_mesh_kv_kernel``: - 0 (default): rank == tl.program_id(axis=0). Existing single_user behavior, all spec tests unchanged. - 1: gate ``if tl.program_id(axis=0) != 0: return`` at kernel start, then ``rank = tl.program_id(axis=1)``. multi_user_* panels pass this to the kernel via ctx.launch positional arg. Also brings in _attention_mesh_kv.py and _attention_mesh_mlo.py as the committed home of the ADR-0059 kernels (previously living uncommitted in the working tree from sub-cycle 4b). Tests: 7-test rank_axis spec file (default-path + rank_axis=1 gating and cube-id semantics, both kernels); 4-panel diag harness now green end-to-end (single_user_prefill/decode + multi_user_prefill/decode); 763-test wider sweep clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
runtime_api: ctx.launch honors DPPolicy.num_cubes + adds _auto_dim_remap opt-out
2026-06-01 19:53:18 -07:00 · 2026-06-01 19:33:40 -07:00 · 2026-06-01 19:14:09 -07:00 · 2026-05-22 15:37:27 -07:00 · 2026-05-22 15:19:52 -07:00 · 2026-05-22 11:15:37 -07:00
56 changed files with 11807 additions and 1484 deletions
@@ -371,6 +371,13 @@ Concrete forms that Part 1's *Verification Plan* MUST take in this repo:
 - `kernbench run --device <id>` runs the benchmark on a single device.
 - Omitting `--device` runs the benchmark on all devices discovered in the topology (logically parallel).
 - Device enumeration is handled by the CLI only; benchmarks MUST remain single-device.
 - **Eval-bench exception (ADR-0054)**: a *milestone / eval bench*
  (`milestone-1h-*`) may drive many configurations and build its own
  per-config engines to regenerate a domain's full result + figure set; it
  ignores `--device` and submits a sentinel tensor to satisfy the
  "must submit ≥1 request" contract (ADR-0045 D4). This is the eval-harness
  carve-out to the single-device rule, alongside the ADR-0024 multi-SIP CCL
  exception.
 ## Derived Artifacts (Clarification)
@@ -7,6 +7,11 @@ Accepted
 `tests/sccl/` 평가 하니스를 문서화한다; 구현과 대조 검증 완료
 (상수, 파일 집합, 스윕 차원을 교차 확인).
 **ADR-0054로 개정됨**: 드라이버 코어, sweep, renderer가 `milestone-1h-ccl`
 bench(단일 home)로 이동했다; `tests/sccl/_allreduce_helpers.py`는 이제 거기서
 re-export한다(pytest 전용 param 빌더 + `_run_distributed` wrapper는 로컬
 유지). figure 테스트는 변경 없음.
 ## Context
 ADR-0032는 intercube all-reduce *알고리즘*을 정의하고, ADR-0023/0024/0027은
@@ -8,6 +8,12 @@ GEMM 평가/특성화 하니스를 문서화한다; 구현과 대조 검증 완
 (상수, tile 크기, figure 집합, script↔test 분할을 교차 확인). D5/D6
 caveat은 부정확이 아니라 기록된 한계다.
 **ADR-0054로 개정됨**: sweep + renderer가 `milestone-1h-gemm` bench(단일
 home)로 이동했다; `scripts/gemm_sweep.py`와 `tests/gemm/`는 이제 거기서
 re-export한다. D1/D2의 "데이터 생성은 수동 script / 무거운 작업은 opt-in"은
 평가-bench 패턴으로 대체된다(하나의 bench가 전부 재생성;
 `MILESTONE_FAST=1`은 committed JSON 재사용).
 ## Context
 ADR-0014(PE pipeline)와 ADR-0042(tile-plan generator)는 GEMM *구현*을
@@ -0,0 +1,265 @@
 # ADR-0045: Bench Module Contract — registration, dispatch, and authoring
 ## Status
 Accepted (2026-05-21).
 `src/kernbench/benches/` 패키지의 등록 메커니즘(@bench), CLI 디스패치 경로
 (`kernbench run/list`), 그리고 새 bench 모듈 작성 시 따라야 할 계약을 통합
 정의한다. ADR-0010 (CLI surface)이 `kernbench list/run` 인터페이스를 명세하나,
 **bench가 어떻게 등록되고 어떤 함수 시그너처를 따라야 하는가**는 ADR 레벨에
 없었음.
 **ADR-0054로 확장됨**: D5의 단일 구성 규칙에 세 번째 패턴이 추가된다 —
 *평가 bench*(예: `milestone-1h-*`)는 여러 구성을 구동하고, 구성별 자체 엔진을
 빌드하며, D4를 만족시키기 위해 sentinel 텐서를 제출한다.
 ## First action (제일 처음에 하는 일)
 `kernbench.benches` 패키지가 임포트되면 `__init__.py` 가 즉시
 `_eager_import_and_audit(__path__, __name__)` 를 호출한다. 이 함수의 첫 일은
 패키지 디렉터리 안의 모든 형제 모듈을 `pkgutil.iter_modules(__path__)`로 나열한
 뒤, 다음 두 조건을 만족하지 않는 모듈을 모두 `importlib.import_module(...)`로
 **즉시 로드**하는 것이다:
 - 이름이 `registry` 인 경우 (인프라 자체)
 - 이름이 `_` 로 시작하는 경우 (helper 모듈)
 임포트 시점에 각 모듈 안의 `@bench(name=..., description=...)` 데코레이터가
 실행되어 `_PENDING` 리스트에 `(name, description, fn)` 튜플이 append 되고,
 `_REGISTERED_MODULES` 셋에 `fn.__module__` 가 추가된다.
 전체 임포트가 끝나면 `_audit_modules(imported, _REGISTERED_MODULES)` 가 호출되어,
 **임포트는 되었지만 @bench를 한 번도 호출하지 않은 모듈**이 있으면
 `RuntimeError("Bench module(s) missing @bench decorator: ...")` 가 즉시 발생한다.
 이 audit이 통과한 시점에 인덱스 할당은 아직 일어나지 않은 상태이며, 첫
 `list_all()` / `resolve(...)` 호출 시 `_finalize()` 가 이름 알파벳 정렬 순으로
 1-based index를 부여한다.
 즉, **bench 인프라의 첫 일은 "패키지 디렉터리의 모든 비-helper 모듈을 임포트
 하고, 각 모듈이 최소 한 번 @bench를 호출했는지 감사하는 것"** 이다.
 ## Context
 `src/kernbench/benches/` 는 현재 8개의 bench 모듈을 보유한다 (`ccl_allreduce`,
 `gemm_single_pe`, `gpt3_qkv`, `ipcq_allreduce`, `matmul_composite`, `qkv_gemm`,
 `qkv_gemm_multi_pe`, `va_offset_verify`). 모든 bench는 다음 통합 흐름을 따른다:
 ```
 kernbench run --topology <T> --bench <N>
   ↓
 cli/main.py::cmd_run
   ↓  resolve_topology(T)  + resolve(N)  + resolve_device(device_arg)
   ↓
 runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
   ↓  engine_factory(topology, device) → GraphEngine
   ↓  RuntimeContext(engine, target_device, correlation_id, spec)
   ↓
 bench_fn(ctx)        ← bench가 정의한 run(torch) 가 호출됨
   ↓  ctx.empty/zeros/from_numpy/launch/distributed.* 등을 통해 submit
   ↓
 ctx.wait_all()       ← 미완료 핸들이 있으면 drain
   ↓
 BenchResult(completion, correlation_id, trace, traces, engine)
 ```
 ADR-0010 은 CLI 표면만 다루고 (`run/list/probe/web`), ADR-0007 은 runtime API ↔
 sim_engine 책임 경계만 다룬다. 정작 "새 bench 파일을 추가하려면 어떤 모양으로
 써야 하는가"는 코드 컨벤션만으로 추적해야 한다. 결과적으로:
 - @bench 데코레이터의 호출 규약 (kebab-case 이름, non-empty description)이
  코드에만 존재.
 - bench 함수 시그너처 (`def run(torch)`) 가 사실상 컨벤션인데, CLI 디스패치 측이
  `spec.run` 을 호출한다는 사실로 강제되고 있음.
 - 신규 bench 추가자가 "helper 모듈은 `_` 접두로 분리해야 한다"는 것을 audit
  RuntimeError를 받아본 뒤에야 학습.
 - single-device 컨벤션 (CLAUDE.md Part 2 CLI Semantics)이 bench 작성자 관점에서
  어디까지 적용되는지 (CCL 멀티-SIP bench는 예외인가?) 명확하지 않음.
 이 ADR이 이런 모호함을 한 곳에 정리한다.
 ## Decision
 ### D1. @bench 데코레이터 계약
 ```python
 from kernbench.benches.registry import bench
@bench(name="my-bench", description="Short, complete-sentence description.")
 def run(torch):
    ...
 ```
 - `name`: kebab-case 문자열. 정규식 `^[a-z][a-z0-9]*(-[a-z0-9]+)*$` 통과 필요.
  소문자/숫자/대시만 허용; 밑줄(`_`) 금지; 알파벳으로 시작.
 - `description`: non-empty 문자열 (strip 후 길이 > 0). CLI `list` 출력에 그대로
  표시됨.
 - 데코레이터는 **fn을 변형 없이 반환**한다 — 즉 직접 호출도 가능. 부수효과로
  `_PENDING` 에 등록만 추가한다.
 위 두 규칙 위반은 즉시 `ValueError`. duplicate name은 `_finalize()` 시점에
 `RuntimeError("duplicate bench name: ...")` 로 잡힌다.
 ### D2. 모듈 파일 컨벤션
 `src/kernbench/benches/<slug>.py` 는 다음 중 하나여야 한다:
 - **bench 모듈**: 최상위 임포트 경로에서 적어도 한 번 `@bench(...)` 가 실행되어
  최소 하나의 bench를 등록한다.
 - **helper 모듈**: 파일명이 `_` 로 시작 (예: `_shared_helpers.py`). `iter_modules`
  순회에서 스킵된다.
 audit (`_audit_modules`) 는 helper가 아닌데도 @bench를 호출하지 않은 모듈을
 허용하지 않는다. 의도된 결과: 새 파일을 `benches/` 에 추가하기만 하면 자동
 등록되며, helper와의 구분은 **파일명 접두사** 하나로 명확하게 표시된다.
 ### D3. bench 함수 시그너처는 `def run(torch)` 다
 데코레이터는 함수 이름을 강제하지 않지만, **CLI 디스패치는 `spec_entry.run`
 (즉 데코레이트된 callable) 을 호출**한다. 따라서 컨벤션은:
 - 함수 이름: `run`. 다른 이름으로 데코레이트해도 동작은 하지만 readability /
  grep-ability 측면에서 항상 `run`.
 - 인자: 단일 위치 인자 `torch`. 실제로는 `RuntimeContext` 인스턴스이며 PyTorch
  스타일의 namespace (zeros/empty/launch/distributed/...)를 노출한다 (ADR-0024 D3).
 - 반환값: 임의 (`Any`). 현재 `run_bench` 는 반환값을 무시하고 `ctx.handles()` /
  `engine.get_completion()` 로 완료를 추적한다.
 `torch` 이름은 PyTorch 호환 idiom을 흉내내기 위함이며, 실제로 PyTorch 모듈이
 들어오는 것은 아니다 (ADR-0024 의 "rank = SIP" launcher 컨벤션과 정렬).
 ### D4. bench는 최소 한 번의 submit을 수행해야 한다
 `run_bench` 는 `ctx.handles()` 가 비어 있는 경우 BenchResult.completion 을
 `ok=False, error_code="NO_REQUESTS"` 로 반환한다. 따라서 의미 있는 bench는
 다음 중 하나 이상을 호출해야 한다:
 - 텐서 생성 API: `torch.zeros(...)`, `torch.empty(...)` — 내부적으로
  `MmuMapMsg` 와 (zeros 의 경우) `MemoryWriteMsg` 가 submit 됨.
 - 커널 실행 API: `torch.launch(name, fn, *args)` — `KernelLaunchMsg` 를 SIP 별로
  submit.
 - (예외) 빈 placeholder bench: `ipcq_allreduce.py` 처럼 `print(...)` 만 하는
  스텁은 NO_REQUESTS 결과를 받게 됨. CI 측에서 placeholder임을 인지하고 별도
  처리하는 것을 가정한다.
 ### D5. 단일-디바이스 컨벤션 + 멀티-SIP 예외 (ADR-0024/0027)
 CLAUDE.md Part 2 CLI Semantics 가 명시하는 **"benchmarks MUST remain
 single-device"** 컨벤션은 다음과 같이 해석된다:
 - **일반 bench (single-SIP 사용)**: `dp = DPPolicy(...)` 로 텐서 placement를
  정의하고 `torch.launch(...)` 로 커널 발사. SIP 인덱스는 `--device` 가
  결정한다 (CLI 측 책임).
 - **CCL bench (멀티-SIP 사용)**: 예외적으로 `torch.distributed.init_process_group
  (backend="ahbm")` + `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` 로
  rank = SIP 패턴 (ADR-0024 D3) 을 따른다. `--device` 는 무시되며 (또는
  `all` 로 가정), 각 spawned worker가 `torch.ahbm.set_device(rank)` 로 자신의
  SIP를 바인딩한다.
 이 두 패턴 외의 멀티-디바이스 호출 (예: 한 bench 함수가 동일 process에서 여러
 SIP을 직접 launch) 은 본 ADR이 금지한다. CLI 가 `--device all` 로 호출되어도
 bench는 한 번만 실행되며, 그 안에서 멀티-SIP을 다루려면 D5의 두 번째 패턴을
 사용한다.
 ### D6. 이름·인덱스 해석 (`resolve`)
 `resolve(identifier: str)` 는 다음 순서로 BenchSpec을 반환한다:
 1. `identifier.isdigit()` → 정수 변환 후 `_REGISTRY` 의 entries에서 `index ==`
   인 spec 반환. 없으면 `ValueError("No bench with index ..."`)`.
 2. `identifier in _REGISTRY` → 직접 lookup.
 3. 그 외 → `ValueError("Unknown bench ...")`.
 빈/공백 identifier 는 `ValueError("bench identifier must be a non-empty string.")`.
 CLI 는 `--bench` 의 인자를 그대로 `resolve` 에 넘긴다. 따라서 사용자는
 `kernbench run --bench gemm-single-pe` 또는 `kernbench run --bench 2` 형식 모두
 사용 가능.
 ### D7. 인덱스는 안정 API가 아니다
 `_finalize()` 가 `_PENDING` 을 **이름 알파벳 정렬** 후 1-based index를 부여하므로,
 새 bench 가 추가되면 기존 bench의 index가 밀릴 수 있다. 따라서:
 - 사람-친화적 인터랙티브 사용: 인덱스 OK.
 - 스크립트 / CI 자동화: 반드시 이름을 사용한다.
 이 사실은 `registry.py` 모듈 docstring 에 명시되어 있다.
 ### D8. RuntimeContext 가 bench에 노출하는 표면
 bench 함수가 `torch` 파라미터를 통해 정상적으로 사용할 수 있는 표면:
 - **텐서 생성**: `torch.empty(shape, dtype=..., dp=DPPolicy(...), name=...)`,
  `torch.zeros(...)`, `torch.from_numpy(arr)`. 모두 host-side 메타 + 디바이스
  배포 (MmuMap + MemoryWrite) 를 submit 한다.
 - **커널 발사**: `torch.launch(kernel_name, kernel_fn, *args)` —
  `(Tensor, int, float)` 위치 인자를 `TensorArg` / `ScalarArg` 로 변환하여
  SIP 별 `KernelLaunchMsg` 발행 후 drain.
 - **동기화**: `torch.wait(handle)`, `torch.wait_all()` (run_bench 가 자동 호출).
 - **분산**: `torch.distributed.init_process_group(backend="ahbm")`,
  `torch.distributed.get_world_size()`, `torch.distributed.all_reduce(t, op=...)`
  (ADR-0024/0027).
 - **멀티-프로세스 (rank=SIP)**: `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)`
  (ADR-0024 D3 / ADR-0027).
 - **디바이스 바인딩**: `torch.ahbm.set_device(rank)` 또는
  `torch.accelerator.set_device_index(rank)` (둘 다 같은 namespace를 가리킴).
 - **IPCQ 설치**: `torch.install_ipcq(algorithm=..., ccl_yaml=...)` (ADR-0023 D10).
 - **스펙 조회**: `torch.spec` — 토폴로지 빌더가 만든 dict (시스템·cube_mesh·HBM
  파라미터 등). bench가 toplogy.yaml 파라미터에 의존하지 않게 짜기 위함.
 bench는 위에 열거되지 않은 RuntimeContext 의 private 멤버 (`_handles`, `_traces`,
 `_allocators` 등) 에 직접 접근해선 안 된다. ADR-0007 의 layer boundary 정신과
 정렬: bench → runtime API → sim_engine 한 방향만 허용.
 ### D9. 환경 변수로 파라미터화는 허용된다
 `matmul_composite.py` 처럼 `os.environ.get("MATMUL_M", ...)` 등으로 bench
 파라미터를 외부에서 주입하는 패턴은 허용한다. 이유:
 - bench 함수 시그너처는 D3 에 의해 `def run(torch)` 로 고정되어 있어 위치/키워드
  인자로 파라미터를 받기 곤란.
 - 환경 변수 패턴은 `MATMUL_VARIANT` 같은 운영-시 스윕을 위한 자연스러운 hook.
 - `scripts/gemm_sweep.py` 같은 외부 드라이버 (ADR-0044) 가 이 hook을 사용한다.
 단, 환경 변수가 bench의 동작을 바꾼다면 모듈 docstring 에 모든 변수를 명시할 것
 (matmul_composite.py 가 그 예시).
 ## Alternatives Considered
 ### A1. 명시적 manifest 파일 (YAML)에 bench 목록 두기
 기각. @bench 데코레이터 + audit 패턴은 "파일 추가 = 자동 등록" 을 보장하여 신규
 bench 작성자의 인지 비용을 한 곳 (파일 작성)으로 집중시킨다. 별도 manifest는
 유지보수 측에서 drift 위험이 크고, helper 분리는 이미 `_` 접두로 명확하다.
 ### A2. bench 함수 이름을 데코레이터 인자로 받기 (`@bench(name=..., entry="run_xxx")`)
 기각. 디스패치 측에서 `spec.run` 하나만 호출하면 되는 단순함을 깬다. `run` 컨벤션
 하나로 충분하며, 변종이 필요하면 같은 모듈에 여러 함수를 등록하면 된다 (각각
@bench 데코레이트).
 ### A3. CCL bench를 위한 별도 `@multi_device_bench` 데코레이터
 기각. D5에서 명시한 두 패턴 (single + ADR-0024 멀티-SIP) 만으로 현재 8개 bench가
 모두 표현 가능. 별도 데코레이터는 디스패치 측에서 분기를 강제하여 복잡도를 늘리며,
 멀티-SIP 사용 의도는 bench 함수 본문의 `init_process_group(...)` 호출로 충분히
 드러난다.
 ### A4. 인덱스를 안정 API로 만들기 (등록 순서 / explicit index= 인자)
 기각. D7에서 명시한 trade-off — 사용자 친화성 (알파벳 정렬된 인덱스가 list 출력
 에서 자연스럽게 1, 2, 3...) 우선. 스크립트는 이름으로 지정하면 충분.
 ## Consequences
 - "bench 추가 방법" 이 한 ADR로 정리됨 → 신규 작성자가 코드 grep 없이 D1-D3,
  D8 만 따르면 됨.
 - helper 모듈을 `_` 접두로 분리하는 패턴이 ADR-level에서 정당화되어, 향후
  `benches/_*.py` 식의 공유 helper 작성이 자유로워짐.
 - CLAUDE.md Part 2 CLI Semantics 의 single-device 컨벤션이 멀티-SIP CCL bench
  와 모순되지 않음을 D5 가 명시 — 둘은 직교한다.
 - ADR-0044 (GEMM eval harness) 의 `scripts/gemm_sweep.py` 가 환경 변수 hook을
  사용하는 근거 (D9) 가 본 ADR에 굳어짐.
 - 인덱스가 불안정함 (D7) 이 명시되어, CI 측 `kernbench run --bench 3` 같은
  코드는 본 ADR 수락 직후 점검 대상.
@@ -0,0 +1,307 @@
 # ADR-0046: TLContext — Kernel-side `tl.*` API Contract
 ## Status
 Accepted (2026-05-22).
 `src/kernbench/triton_emu/` 의 `TLContext` 가 노출하는 `tl.*` primitive
 집합과 그 의미, 그리고 두 실행 모드 (command-list / greenlet runner) 의
 계약을 명시한다. ADR-0014/0020 가 PE 파이프라인과 2-pass 실행 모델을
 정의하나, **bench 의 kernel 함수가 호출하는 `tl.*` 표면 자체**는 ADR-level
 에 정리되어 있지 않았다.
 ## First action (제일 처음에 하는 일)
 `TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id, num_cubes,
 scratch_base, scratch_size)` 생성 시 가장 먼저 다음 6개 필드를 초기화한다:
 - `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
  `tl.program_id` / `tl.num_programs` 가 반환할 값.
 - `self._dispatch_cycles` — 모든 `tl.*` API 호출 시작에서 자동으로 발행될
  `PeCpuOverheadCmd(cycles)` 의 cycle 수.
 - `self._runner` — `KernelRunner` 인스턴스 (있으면 greenlet 모드, 없으면
  command-list 모드).
 - `self._commands: list[PeCommand] = []` — command-list 모드에서 누적할
  command 시퀀스.
 - `self._handle_counter = 0`, `self._completion_counter = 0` — 새 TensorHandle /
  CompletionHandle id 생성용.
 - `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
  PE-로컬 scratch 영역 (math/dot/composite 의 output handle 주소 할당용).
 즉, **TLContext 의 첫 일은 "이 kernel 인스턴스가 어디서 (sip/cube/pe) 어떤
 규모 (num_programs/num_cubes) 로 실행되며, 어느 모드 (runner 유무) 로
 명령을 발사할지 메타데이터를 채우는 것"** 이다. 이 시점에 SimPy event 는
 없으며 command 도 발사되지 않는다.
 런타임 첫 동작은 kernel 함수가 `tl.<api>()` 를 처음 호출할 때 발생한다.
 모든 `tl.*` API 의 표준 entry 동작은:
 1. `self._emit_dispatch_overhead()` 호출 — `dispatch_cycles > 0` 인 경우
   `PeCpuOverheadCmd(dispatch_cycles)` 를 즉시 `_emit`.
 2. API 별 처리 (TensorHandle 생성, command 구성).
 3. `self._emit(cmd)` — runner 모드면 greenlet.switch 로 SimPy 측에 cmd 전달,
   아니면 `self._commands` 에 append.
 ## Context
 `tl.*` 표면은 `TLContext` 가 노출하는 메소드들로 구성되며, kernel 함수가
 받는 `tl` 매개변수가 이 객체다. 사용자(bench 작성자) 입장에서 보이는
 contract:
 - 어떤 primitive 가 있는가
 - 각 primitive 가 어떤 데이터 흐름을 발생시키는가 (DMA / compute / IPCQ /
  metadata-only)
 - TensorHandle 의 `space` 와 `addr` 가 어떻게 결정되는가
 - command-list 모드와 greenlet 모드의 차이
 ADR-0014 (PE pipeline) 가 PE_SCHEDULER 가 받는 PeCommand 들을 정의하나,
 `tl.*` 가 이들을 어떻게 emit 하는지는 코드 컨벤션에만 존재한다. 또한
 ADR-0020 (2-pass data execution) 가 greenlet 모드의 존재를 D3 에서
 언급하나, runner / non-runner 두 경로의 시그너처 차이 (return value 처리)
 는 ADR-level 에 명시되어 있지 않다. 이 ADR 이 그 빈자리를 채운다.
 ## Decision
 ### D1. `tl` 매개변수는 `TLContext` 인스턴스다
 bench 의 kernel 함수는 다음 시그너처를 따른다:
 ```python
 def _kernel(arg1, arg2, ..., tl, **kwargs):
    ...
 ```
 `tl` 의 정체는 `kernbench.triton_emu.tl_context.TLContext` 인스턴스이다.
 real Triton 의 `triton.language` 모듈을 흉내내기 위한 이름이며, real
 Triton 모듈이 들어오는 것은 아니다.
 kernel 함수는 일반 Python 함수이며 `yield` / `async` 가 없다. `tl.*`
 호출이 SimPy event 를 발생시키지만, 호출자(kernel) 쪽에서는 동기 호출처럼
 보인다 — greenlet 모드에서 KernelRunner 가 SimPy ↔ kernel 사이를 중계
 하기 때문 (ADR-0020 D3).
 ### D2. 두 실행 모드 — command-list / greenlet runner
 - **command-list 모드 (`runner is None`)**: `tl.*` 호출이 `self._commands`
  리스트에 PeCommand 를 누적. DMA / GEMM / Math 가 실제 SimPy 시간을
  소비하지 않으며, return value 가 metadata-only TensorHandle (data=None) 다.
  이후 PE_SCHEDULER / sim_engine 가 command 시퀀스를 시간상 재생.
 - **greenlet runner 모드 (`runner is not None`)**: `tl.*` 호출이
  `self._emit(cmd)` 를 통해 `runner.switch_to_simpy(cmd)` 로 부모 greenlet
  (SimPy) 으로 컨트롤을 넘김. 부모는 cmd 를 컴포넌트에 분배하여 SimPy 시간을
  소비한 뒤, DMA read 의 경우 실제 numpy 데이터를 반환. kernel 은 그
  결과를 받아 다음 line 으로 진행 (ADR-0020 D3 의 데이터 인지 실행 모델).
 mode 선택은 KernelRunner 인스턴스를 TLContext 에 주입하는지 여부로 결정
 되며, `tl.*` 메소드들은 이 차이를 인지하지 않고 `_emit()` 헬퍼를 통해
 일관되게 동작한다.
 ### D3. Primitive 카테고리
 #### D3.1. Reference (no DMA, metadata only)
 - `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: HBM 데이터를 참조하는
  핸들만 만들고 DMA 는 발행하지 않음. composite scheduler 가 per-tile 로
  스트리밍할 때 사용 (예: GEMM 의 b 피연산자).
 #### D3.2. Data movement (blocking, DMA engine)
 - `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → 결과 핸들.
  `DmaReadCmd` 발행. greenlet 모드에서는 결과 핸들의 `.data` 에 실제
  numpy 배열 첨부; command-list 모드에서는 placeholder. 반환 핸들의
  `space="hbm"`, `pinned=True`.
 - `tl.store(ptr, handle) -> None`: TCM → HBM. `DmaWriteCmd` 발행. greenlet
  모드에서는 `handle.data` 가 있을 때만 `_store.write("hbm", ptr, data)` 를
  먼저 호출 (visibility = issue time, ADR-0020 D3).
 #### D3.3. GEMM / compute (blocking)
 - `tl.dot(a, b) -> TensorHandle`: `a @ b`. 두 피연산자는 TCM 이어야 하며,
  shape (M,K) × (K,N) → (M,N). `GemmCmd` 발행, output handle 은
  `_make_compute_out(shape, dtype)` 로 PE-로컬 scratch 에 할당.
 - `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
  acc_dtype=None, tile_shape=None) -> CompletionHandle`: 비차단(non-blocking)
  tiled pipeline. `CompositeCmd` 발행. `epilogue` 는 dict list, 각 dict 는
  `"op"` 키 + op-specific 필드 + 옵션 `"scope"` (k_tile / output_tile);
  unknown op 나 missing field 는 즉시 ValueError. 반환된 CompletionHandle 은
  `tl.wait(h)` 로 동기화.
 #### D3.4. Math: unary (blocking)
 - `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
  `tl.cos(x)`, `tl.sin(x)` — 모두 `MathCmd(op=<name>, inputs=(x,), out=)`
  발행. `out` 은 동일 shape/dtype 의 scratch 할당.
 #### D3.5. Math: binary (blocking)
 - `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
 - `tl.fma(a, b, c)` — `a*b + c`. inputs 3개.
 - `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
 - `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
 - `tl.softmax(x, axis=-1)` — 단일 MathCmd(op="softmax") 로 시간 회계는
  한 번에. Phase 2 DataExecutor 가 canonical (x-max → exp → sum → div) 로
  expand 한다.
 #### D3.6. Reduction (blocking)
 - `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — 해당 axis 의
  크기를 1 로 줄인 output handle 을 반환. `MathCmd(op=<name>, inputs=(x,),
  out=, axis=axis)` 발행.
 #### D3.7. Index / scalar (PE_CPU, no engine)
 - `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE 인덱스),
  `axis==1` → cube_id (ADR-0022).
 - `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (cube 당
  PE 수), `axis==1` → num_cubes.
 - `tl.arange(start, end, dtype="i32") -> TensorHandle`: TCM 의 인덱스
  range. command 발사 없이 metadata 만.
 - `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape, value,
  dtype="f16") -> TensorHandle`: TCM 에 placeholder. command 발사 없음.
 #### D3.8. Scalar helpers (no command, no engine)
 - `TLContext.cdiv(a, b) -> int` (static): ceiling division
  `-(-a // b)`. real Triton 의 `tl.cdiv` 모방.
 #### D3.9. Metadata-only (no compute, no DMA)
 - `tl.trans(x) -> TensorHandle`: shape 의 마지막 두 dim 을 swap 한 새
  핸들. 같은 addr/data 를 공유, command 발사 없음.
 #### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
 - `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
  dtype="f16", space="tcm") -> None`: blocking send. handle 형태 또는
  raw 주소 형태 둘 다 허용. `IpcqSendCmd` 발행. handle 의 `.data` 스냅샷이
  명령에 실리는 경우, recv 측에서 받은 데이터의 race 회피.
 - `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
  dst_space=None) -> TensorHandle`: blocking recv. `dst_addr/dst_space`
  둘 다 주면 "copy_to_dst" 모드, 아니면 "return_slot" 모드. greenlet
  모드에서 핸들의 `.data` 에 실제 데이터 첨부.
 - `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
  **DIAGNOSTIC ONLY**. recv blocking 동기화는 그대로 적용되나 slot-read
  latency (slot-IO + PE↔bank fabric drain) 는 건너뛴다. pe2pe overview
  플롯에서 `tl.store` 와의 apples-to-apples 비교용. production kernel 은
  사용 금지 — `consume=False` 라는 별도 명령 분기로 격리되어 있어 실수
  flag 가 작동하지 않는다.
 - `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
  recv. `RecvFuture` 를 반환; 이후 `tl.wait(future)` 로 결과 수령.
 #### D3.11. Composite + control
 - `tl.composite(...)`: D3.3 에서 설명.
 - `tl.wait(handle=None)`: `CompletionHandle` (composite) 또는 `RecvFuture`
  (async recv) 또는 `None` (모든 pending composite) 대기.
 - `tl.cycles(n)`: PE_CPU scalar 실행 overhead 를 명시적으로 선언.
  `PeCpuOverheadCmd(cycles=n)` 발행.
 ### D4. TensorHandle 산술 연산자 — thread-local TLContext
 `tl_context.py` 모듈 로드 시점에 `_enable_tensor_ops()` 가 호출되어
 `TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__` 를 patch한다.
 각 연산자는 thread-local `_ctx` (모듈 변수) 에 저장된 active TLContext 의
 `_binary_math` 를 호출한다.
 따라서 kernel 안에서 `c = a + b` 는 `MathCmd(op="add", inputs=(a,b),
 out=)` 발행 + new TensorHandle 반환 패턴과 동일하다.
 active TLContext 관리:
 - `TLContext._set_active(ctx)`: 현재 thread/greenlet 의 active ctx 설정.
 - `TLContext._get_active()`: 조회 (없으면 RuntimeError).
 - `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. 진입 시
  active 설정, kernel 실행, 종료 시 None 으로 복원.
 `KernelRunner` 는 매 cmd 분배 시 `_switch_kernel` 가 직접 `_set_active(tl)`
 를 호출하여, 같은 thread 안의 다른 PE runner 가 active 를 덮어쓴 경우에도
 복원되도록 한다.
 ### D5. Scratch allocator — compute output handles
 `tl.dot`, `tl.exp`, `tl.add` (TensorHandle `__add__`) 등 결과를 만드는 op 는
 `_make_compute_out(shape, dtype)` 를 호출하여 16-byte aligned scratch
 주소를 할당한다. 이 주소는 `space="tcm"` 로 발행되며, 이후 `tl.send` /
 `tl.store` 가 이 handle 을 source 로 사용할 수 있다.
 `_scratch_base == 0` (command-list 모드 등) 이면 할당 주소가 0으로
 반환되어 handle 은 send/store 의 source 로 사용 불가 (이 경우 `tl.load`
 로 받은 핸들만 source 가 될 수 있다).
 cursor 가 `_scratch_size` (default 1 MiB) 를 초과하면 RuntimeError.
 cursor 는 매 kernel invocation 시작 시 0 으로 리셋되어야 하나 (현재 코드는
 KernelRunner 가 새 TLContext 를 매번 생성하여 자연스럽게 리셋됨).
 ### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
 모든 non-metadata `tl.*` 호출의 entry 에서 `_emit_dispatch_overhead()` 가
 호출되며 `dispatch_cycles > 0` 일 때 `PeCpuOverheadCmd(dispatch_cycles)`
 를 발행한다. PE_CPU 가 명령 dispatch 자체에 소비하는 cycle 비용을
 모델링하기 위함이다.
 기본값:
 - `TLContext.__init__` 의 `dispatch_cycles` 매개변수 기본값: 1 cycle.
 - `KernelRunner` 가 만드는 TLContext: 0 cycles (greenlet 모드는 cycle
  회계가 별도, ADR-0020 D3 정신).
 ### D7. Kernel registry (`triton_emu/registry.py`)
 별도의 `_kernels: dict[str, Callable]` 가 kernel 이름 → 함수 매핑을 보유:
 - `register_kernel(name, fn)`: duplicate 등록 시 ValueError.
 - `get_kernel(name)`: 미등록 시 KeyError.
 - `clear_registry()`: 테스트 전용.
 `RuntimeContext.launch(kernel_name, kernel_fn, *args)` 가 매 호출마다
 `_kernels[kernel_name] = kernel_fn` 으로 idempotent 덮어쓴다 (last call
 wins). 이는 ADR-0045 D8 의 launch 동작과 정합된다.
 PE_CPU 는 `KernelRef.name` 으로 registry 에서 kernel 함수를 lookup 한 뒤
 KernelRunner 로 실행한다.
 ## Alternatives Considered
 ### A1. tl.* 를 ADR-0014 / ADR-0020 안으로 통합
 기각. ADR-0014 는 PE pipeline (PeCommand 의 sim_engine 측 소비) 를, ADR-0020
 은 2-pass 실행 (Phase 1 timing / Phase 2 data) 을 다룬다. `tl.*` 는 kernel
 작성자가 만나는 API 표면이라 독립 분리하는 것이 검색성·온보딩 측면에서
 낫다.
 ### A2. command-list 모드 deprecation
 기각 (현재). 단순한 unit test 와 kernel verification 에서 command-list
 모드가 가볍게 동작한다. greenlet 의존성 없이 PeCommand 시퀀스를 검사할 수
 있는 출입구로 유지한다. greenlet 모드만의 의미 (실데이터, Phase 2) 가
 필요하면 D2 의 mode 선택으로 명시적으로 들어간다.
 ### A3. TensorHandle 산술 연산자 제거
 기각. real Triton 의 kernel 코드 가독성을 흉내내기 위함이며 (예: `c = a +
 b`), thread-local active ctx 패턴이 깔끔하게 작동 중. 명시적 `tl.add(a, b)`
 도 D3.5 에 노출되어 있어, 연산자가 헷갈리면 함수형 호출로 대체 가능.
 ### A4. softmax 를 명시적 시퀀스 (max → exp → sum → div) 로 expand
 부분 채택. `tl.softmax` 는 단일 `MathCmd(op="softmax")` 로 timing 회계는
 한 번에 처리한다 (D3.5). 실 데이터 expansion 은 Phase 2 DataExecutor 가
 canonical 시퀀스로 풀어준다. 즉, 시간 모델은 atomic, 데이터 모델은
 expansion — 두 마리 토끼를 의도적으로 분리.
 ## Consequences
 - bench 작성자가 만나는 모든 `tl.*` primitive 가 한 ADR 에 분류·정의됨.
  ADR-0045 D8 의 host-side surface (torch.empty 등) 와 짝을 이루어 "kernel
  안 / 밖" 양쪽 작성 가이드가 완성.
 - command-list / greenlet 두 모드의 차이가 D2 에 명시되어, 새로운 `tl.*`
  primitive 추가 시 `_emit()` 패턴만 따르면 양쪽 자동 호환됨.
 - thread-local active ctx 패턴 (D4) 이 ADR-level 에서 정당화되어, 향후
  multi-PE 동일-thread 실행 시 reset 책임이 어디인지 명확해짐
  (`_switch_kernel` 가 cmd 분배 시 active 복원 — KernelRunner.run 의
  contract).
 - `tl.recv_no_consume` 의 진단 전용 격리(D3.10) 가 ADR 에 굳어져, 실수로
  production kernel 에서 사용되는 것을 막는 layer 가 명확.
 - registry (D7) 가 별도 D 항목으로 분리되어, kernel 이름 충돌 / 동적
  재등록 동작의 사양이 명시.
@@ -0,0 +1,243 @@
 # ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
 ## Status
 Accepted (2026-05-22).
 `runtime_api/distributed.py` 의 `AhbmCCLBackend` + `DistributedContext` —
 즉 `torch.distributed.init_process_group(backend="ahbm")` 진입점이 실제로
 무엇을 설치하고 어떤 의미로 `all_reduce`/`barrier`/`get_rank` 등을
 구현하는지를 명시한다. ADR-0023 D11 이 "torch.distributed compatibility"
 의도를 언급하나, **backend 자체의 동작 모델**은 ADR-level 에 없었다.
 ## First action (제일 처음에 하는 일)
 `RuntimeContext.__post_init__` 가 자동으로 `DistributedContext()` 인스턴스를
 만들어 `self.distributed` 에 attach 한다. 그 시점의 첫 일은:
 1. `self._backend: AhbmCCLBackend | None = None` 으로 초기화 (아직 init
   되지 않은 상태).
 2. `self._rank_by_greenlet: dict = {}` 로 greenlet-local rank 레지스트리
   초기화 (ADR-0024 D2).
 3. 호출자(RuntimeContext) 측에서 `dc._ctx_ref = self` 로 back-reference 를
   심어, 이후 `init_process_group` 가 `ctx.engine` / `ctx.spec` / `ctx.launch`
   에 도달할 수 있게 한다.
 즉, **DistributedContext 의 첫 일은 "RuntimeContext 에 자기 자신을
 back-reference 와 함께 부착하고 backend 슬롯을 비워두는 것"**. 실제 backend
 설치(IPCQ install, world_size 산출, 알고리즘 모듈 로드)는 사용자 코드의
 `torch.distributed.init_process_group(backend="ahbm")` 호출 시점에 비로소
 일어난다.
 해당 시점의 `init_process_group` 의 첫 일은:
 1. `backend != "ahbm"` 이면 즉시 `ValueError("Unsupported backend ...")`.
 2. `getattr(self, "_ctx_ref", None)` 가 None 이면
   `RuntimeError("DistributedContext not bound to a RuntimeContext")`.
 3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — 이 생성자 안에서
   ccl.yaml load + 알고리즘 모듈 import + world_size 산출 + SFR 설정 +
   IPCQ install 이 모두 일어난다.
 4. `self._backend._dist_ctx = self` — backend 가 거꾸로
   `_rank_by_greenlet` 에 접근할 수 있게 함.
 ## Context
 PyTorch DDP 의 collective 호출 (`init_process_group`, `all_reduce` 등) 을
 그대로 사용할 수 있게 만들어, bench 코드가 "진짜 DDP training script" 와
 동일한 모습이 되도록 하는 것이 `AhbmCCLBackend` 의 목적이다 (ADR-0024 +
 ADR-0027 의 launcher 모델과 정렬).
 이 backend 가 책임지는 것:
 - `init_process_group` 시점에 **IPCQ neighbor table 을 한 번 설치** (real
  NCCL communicator creation 과 유사).
 - `all_reduce(tensor, op="sum")` 호출 시 **설정된 algorithm 의 kernel 함수
  를 `ctx.launch(...)` 로 발사**.
 - `get_world_size` / `get_rank` 를 greenlet-local rank 레지스트리와
  ccl.yaml/topology 로부터 일관되게 답함.
 ADR-0023 D10 (IPCQ install plan), ADR-0024 (SIP launcher) 가 부분적으로
 이를 다루나, **`AhbmCCLBackend` 자체의 책임 범위와 의사결정 순서**는
 어디에도 명시되어 있지 않다. 본 ADR 이 채운다.
 ## Decision
 ### D1. backend 는 `init_process_group(backend="ahbm")` 시점에만 생성된다
 `DistributedContext` 는 `__init__` 시점에 `_backend = None` 으로 시작한다.
 backend 객체는 사용자가 `dist.init_process_group(backend="ahbm")` 를
 호출하기 전까지 존재하지 않으며, 그 외 API (`is_initialized`,
 `get_world_size`, `all_reduce`, `barrier`) 가 backend 가 None 인 채로
 호출되면 `RuntimeError("Default process group has not been initialized...")`
 를 던진다 (`_ensure_initialized` 헬퍼).
 `backend != "ahbm"` 은 즉시 `ValueError`. 다른 backend 명 (nccl, gloo
 등) 은 인식하지 않는다.
 ### D2. world_size 산출 우선순위 — algorithm > defaults > topology
 `AhbmCCLBackend._resolve_world_size` (ADR-0024 D1) 의 결정 순서:
 1. `ccl.yaml` 의 algorithm entry 에 `world_size` 가 있으면 그 값.
 2. `defaults.world_size` 가 있으면 그 값.
 3. 둘 다 없으면 `spec.system.sips.count` (=topology 의 SIP 개수).
 기본 의미는 **rank = SIP** (ADR-0024). cube/PE-level parallelism 은 각
 rank 안에서 DPPolicy 로 표현되며 world_size 에 영향을 주지 않는다. 명시적
 `ccl.yaml` 의 world_size override 가 있으면 legacy "rank = flat PE 인덱스"
 테스트 경로를 위해 그대로 존중된다.
 `init_process_group(world_size=..., rank=...)` 의 사용자 인자는 **수신하나
 무시**된다 (real PyTorch 의 `RANK` / `WORLD_SIZE` env var 와 같은 의미).
 ### D3. `init_process_group` 가 즉시 하는 4가지 설치 작업
 `AhbmCCLBackend.__init__` 안에서 다음이 순차 실행된다:
 1. **ccl.yaml 로딩**: `kernbench.ccl.install.load_ccl_config()` →
   `resolve_algorithm_config(_cfg_all)` 로 `defaults.algorithm` (또는
   사용자가 지정한 알고리즘) 의 merged config 산출.
 2. **알고리즘 모듈 import**: `importlib.import_module(self._merged["module"])`.
   이 모듈은 `kernel` 함수, `kernel_args(world_size, n_elem, cube_w, cube_h)`
   helper, optional `TOPO_NAME_TO_KIND` 매핑을 노출해야 한다.
 3. **world_size 산출** (D2).
 4. **topology 메타 수집**: `spec` 으로부터 `n_sips`, `sip_topo` (`ring_1d`
   기본), `cube_w`/`cube_h`, `sips.w`/`sips.h`. SIP topology 가 ring_1d 가
   아니면 explicit `w`/`h` 또는 square root 로 (`w*h == n_sips` 보장)
   `_sip_topo_w/h` 산출. 불일치 시 `ValueError`.
 5. **SFR + IPCQ 설치**: `kernbench.ccl.sfr_config.configure_sfr_intercube_multisip
   (engine, spec, self._merged)` 를 호출. 이 함수가 모든 SIP/cube 의 pe0 에
   IPCQ neighbor table 을 푸시 (real NCCL communicator 의 일회성 설정에
   해당).
 이 순서가 변하면 (예: SFR 전에 algorithm 모듈 load 가 실패하면) 부분 초기화
 상태가 발생할 수 있다. 따라서 D3 는 atomic 한 4-단계로 본다 — 실패 시
 backend 는 미설치 상태로 남는다.
 ### D4. greenlet-local rank 등록 (ADR-0024 D2)
 `DistributedContext._rank_by_greenlet: dict[greenlet, int]` 은 spawn 된
 worker greenlet 각각에 rank 를 매핑한다. bench launcher (예:
 `torch.multiprocessing.spawn`) 가 worker 를 띄울 때
 `dc._bind_rank(g, rank)` 를 호출하여 등록한다.
 `get_rank()` 는 `getcurrent()` 의 greenlet 을 lookup. 미등록 greenlet은
 fallback 으로 0 을 반환 — single-driver / 테스트 호환성 유지.
 backend 는 `_dist_ctx._rank_by_greenlet` 를 통해 `all_reduce` 시 현재
 greenlet 의 rank 를 가져온다 (D5).
 ### D5. `all_reduce(tensor, op="sum")` 동작
 검증 단계:
 - `op != "sum"` → `NotImplementedError`. 현재 kernel 들은 add reduction만 구현.
 - `tensor._handle is None` → `RuntimeError("not deployed")`.
 - `tensor._handle.shards` 가 비면 `RuntimeError("no shards")`.
 준비 단계:
 - `n_elem = shards[0].nbytes // tensor.itemsize` — 단일 shard 의 element 수.
 - `kernel_fn = self._algo_module.kernel` — D3 에서 import 된 알고리즘 모듈의
  진입 함수.
 - effective cube dims 결정: 첫 번째 SIP 의 cube 갯수가 1 이면 (1,1) 으로
  scalar 처리, 아니면 토폴로지의 `cube_w`/`cube_h` 사용. TP 가 일부 cube
  만 쓰는 경우를 자연스럽게 흡수.
 - `kernel_args = self._algo_module.kernel_args(world_size, n_elem, cube_w,
  cube_h)` — 알고리즘이 자기 kernel 에 넘길 인자 셋을 결정.
 dispatch:
 - 현재 greenlet 의 rank 를 `_rank_by_greenlet.get(g, 0)` 로 lookup.
 - `extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` 를 append.
 - `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor, *kernel_args,
  *extra_args, _defer_wait=True)` — `_defer_wait=True` 로 collective drain
  을 메인 scheduler 에 위임 (ADR-0027 D0.4).
 drain:
 - 부모 greenlet 이 살아있으면 (multi-greenlet 모드) `_pending_collective_handles`
  에 enqueue 한 뒤 부모로 switch. 메인 scheduler 가 모든 rank 의 launch 후
  일괄 drain.
 - 단일-driver 모드면 inline 으로 `for h, _sip_id, meta in pending:
  self.ctx.wait(h, _meta=meta)` 즉시 drain.
 ### D6. `barrier()` 는 no-op 이다 (single-driver 모델)
 kernbench 는 하나의 Python process 안에서 모든 rank 를 greenlet 으로 다룬다.
 process 간 동기화가 필요한 상황이 없으므로 `barrier()` 는 호출 가능하지만
 실제 어떤 동기화도 수행하지 않는다. real PyTorch DDP 와의 API 호환성을
 위해 유지 (호출자가 NotImplementedError 를 받지 않도록).
 장래에 multi-process kernbench (예: SimPy event loop 가 process 별로
 독립) 가 도입되면 D6 를 supersede 하는 새 ADR 이 필요.
 ### D7. `get_rank` / `get_world_size` / `get_backend` 의 의미
 - `get_rank()` (D4): 현재 greenlet 의 bound rank. 미등록은 0.
 - `get_world_size()` (D2): backend 가 D3 에서 산출한 world_size.
 - `get_backend()`: 항상 `"ahbm"` 문자열. backend 객체가 존재하지 않으면
  `_ensure_initialized` 에서 RuntimeError.
 real PyTorch 와의 차이:
 - real PyTorch `get_rank()` 는 process global 값이지만, kernbench 는
  greenlet-local. spawn 된 worker 안에서 호출하면 rank, main thread 에서
  호출하면 0. bench 작성자는 worker 함수 안에서만 의미 있는 rank 를 기대해야
  한다.
 ### D8. 지원하는 API 표면 (final)
 `DistributedContext` 가 노출하는 API:
 - `init_process_group(backend="ahbm", world_size=None, rank=None, **kwargs)`
 - `is_initialized() -> bool`
 - `get_world_size() -> int`
 - `get_rank() -> int`
 - `get_backend() -> str`
 - `all_reduce(tensor, op="sum") -> None`
 - `barrier() -> None`
 - (internal) `_bind_rank(g, rank)`
 이외의 PyTorch distributed API (broadcast, reduce, all_gather, gather,
 scatter, send/recv 등) 는 **아직 구현되어 있지 않다**. kernel 레벨에서는
 `tl.send`/`tl.recv` (ADR-0046 D3.10) 로 직접 표현 가능하나, dist.* surface
 로는 노출되지 않는다. 추가 collective 가 필요해질 시 별도 알고리즘 모듈
 + `DistributedContext` 메소드 한 쌍을 추가하여 D8 를 확장한다.
 ## Alternatives Considered
 ### A1. backend 를 `RuntimeContext.__init__` 에서 즉시 생성
 기각. ccl.yaml 이 없거나 알고리즘 모듈을 import 할 수 없는 경우, bench 가
 distributed 기능을 안 쓰는데도 RuntimeContext 생성 자체가 실패하게 된다.
 "호출 시점에 비로소 설치" (D1) 가 lazy 의미상 옳다.
 ### A2. world_size 를 항상 topology 로부터 자동 산출 (override 금지)
 기각. ADR-0024 D1 의 "explicit override" 경로가 legacy 테스트에서 사용 중.
 한 SIP 안에서 PE-level rank 를 따로 정의해야 하는 진단 시나리오를 위해
 유지.
 ### A3. `op != "sum"` 을 silent fallback 으로 처리
 기각. 사용자가 `op="prod"` / `"max"` / `"avg"` 를 의도했는데 silently sum
 이 실행되면 결과 검증이 매우 어렵다. 명시적 `NotImplementedError` 가 안전.
 ### A4. `barrier` 를 SimPy event 로 구현
 기각 (현재). single-driver 모델에서 cross-process 동기화 의미가 없으므로
 no-op 가 의미적으로 정확. SimPy fake-barrier 는 의미 없이 코드 복잡도만
 높임. multi-process kernbench 도입 시 재평가.
 ## Consequences
 - `torch.distributed.init_process_group(backend="ahbm")` 의 4-단계 설치
  (D3) 가 ADR-level 에서 굳어져, 향후 새 collective 알고리즘이 어디에
  훅을 걸어야 하는지 명확.
 - D2 의 우선순위 (algorithm > defaults > topology) 가 명시되어, ccl.yaml
  변경 시 영향 범위를 빠르게 가늠 가능.
 - D6 의 barrier no-op 결정이 ADR-level 에 굳어져, multi-process kernbench
  도입 시 별도 ADR 로 supersede 해야 함이 분명.
 - D8 의 미지원 API 목록이 명시되어, 사용자가 `dist.broadcast(...)` 를
  호출하려 할 때의 명확한 거절 근거 제공.
@@ -0,0 +1,262 @@
 # ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
 ## Status
 Accepted (2026-05-22).
 `policy/address/allocator.py` 의 `_FreeList` / `PEMemAllocator` 와
 `va_allocator.py` 의 `VirtualAllocator` 가 사용하는 free-list 알고리즘,
 페이지 정렬, coalescing 규칙을 명시한다. ADR-0001 (PhysAddr 레이아웃) 과
 ADR-0011 (PA/VA/LA 모델) 이 주소 스킴을 정의하나, **할당 알고리즘**은 별도
 ADR 이 없었다.
 ## First action (제일 처음에 하는 일)
 ### `_FreeList(capacity)`
 생성 즉시 `self._capacity = capacity`, `self._used = 0`, `self._free =
 [(0, capacity)]` 로 초기화. 첫 일은 **전 영역을 single free block 으로
 세우는 것** — 즉 `(offset=0, size=capacity)` 한 튜플이 free list 의 유일한
 원소다.
 ### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
 생성 즉시 두 개의 `_FreeList` 를 만든다:
 - `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — 이 PE 가 소유한 HBM
  slice 의 바이트 크기 (`hbm_bytes_per_cube // hbm_slices_per_cube`) 만큼.
 - `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — `tcm_bytes_per_pe -
  tcm_scheduler_reserved_bytes` 만큼 (scheduler 예약분은 사전 분리).
 따라서 PEMemAllocator 의 첫 일은 **이 PE 의 HBM slice 와 사용자
 TCM 영역을 각각 단일 free block 으로 세우는 것**.
 ### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
 생성 즉시 `self._va_base = va_base`, `self._va_size = va_size`,
 `self._page_size = page_size`, `self._used = 0`, `self._free = [(va_base,
 va_size)]`. 첫 일은 **VA base 부터 size 까지 single block 으로 세우고
 page_size 를 회수**.
 ## Context
 `runtime_api/context.py::_ensure_allocators` 는 다음 단계로 allocator 세트를
 구성한다:
 1. spec 으로부터 `hbm_total_gb_per_cube`, `hbm_slices_per_cube`,
   `tcm_size_mb`, target_device 별 SIP 범위 등을 읽음.
 2. `AddressConfig` 로 모든 파라미터를 frozen 하게 패킹.
 3. target SIP 범위 × cube × PE 의 모든 조합에 대해
   `PEMemAllocator(sip, cube, pe, cfg)` 인스턴스를 1개씩 생성.
 4. `VirtualAllocator(va_base=0x1_0000_0000, va_size=64 GiB,
   page_size=pe_mmu.page_size)` 를 1개 생성.
 allocator 들의 책임:
 - **PEMemAllocator**: PE-로컬 HBM slice / TCM 의 PA-공간 할당 (PhysAddr
  encoding 까지 포함).
 - **VirtualAllocator**: device-wide VA 공간을 페이지 정렬로 할당. 이후
  `RuntimeContext._create_tensor` 가 VA → PA 매핑을 `MmuMapMsg` 로 fabric
  에 push.
 이 알고리즘들은:
 - **first-fit** 으로 단순.
 - 자유 블록 리스트는 **offset 정렬 (sorted by start)** 유지.
 - `free()` 시 **양쪽 인접 블록과 coalesce**.
 이런 결정의 근거가 어디에도 없으므로, 향후 누군가 "왜 best-fit 이 아닌가",
 "왜 buddy allocator 가 아닌가", "왜 partial overlap free 가 silently
 허용되는가" 라는 질문에 답할 기준이 필요. 본 ADR 이 그 기준을 마련한다.
 ## Decision
 ### D1. `_FreeList` — offset-기반 first-fit + coalescing
 `policy/address/allocator.py::_FreeList`:
 - 내부 표현: `list[tuple[int, int]]` = `[(start_offset, size), ...]` —
  start offset 으로 정렬된 자유 블록의 sorted list.
 - `alloc(nbytes)`:
  1. free list 를 앞에서부터 순회 (first-fit).
  2. 처음 만나는 `size >= nbytes` 인 블록에서 앞부분을 잘라 사용.
  3. 정확히 일치하면 블록 통째로 제거; 아니면 `(start+nbytes, size-nbytes)`
     로 축소.
  4. `_used += nbytes`, 잘라낸 `start` 반환.
  5. 맞는 블록이 없으면 `AllocationError("overflow ... largest free block
     ...")`.
 - `free(offset, nbytes)`:
  1. `_used -= nbytes`.
  2. `bisect_left(self._free, (offset,))` 로 삽입 위치 결정.
  3. 직전 블록과 인접 (`prev_start + prev_size == offset`) 하면 흡수.
  4. 직후 블록과 인접 (`offset+nbytes == next_start`) 하면 흡수.
  5. coalesced range 를 정렬 위치에 insert.
 이 알고리즘은 fragmentation 에 약점이 있으나 (best-fit / buddy 대비), 본
 시뮬레이터의 워크로드 특성상 (deploy/free 패턴이 거의 stack-like) 충분
 하다는 것이 디자인 가정이다. 워크로드가 변하면 D1 supersede 후보.
 ### D2. partial overlap free 는 **검사하지 않는다**
 `_FreeList.free(offset, nbytes)` 는 호출자가 정확한 (offset, nbytes) 를
 넘긴다고 신뢰한다. 다음을 검증하지 않는다:
 - 그 range 가 실제로 alloc 된 것인지.
 - 그 range 가 다른 alloc 된 영역과 겹치지 않는지.
 이유: 시뮬레이터 컨텍스트에서 호출자는 항상 `alloc()` 의 반환값을 그대로
 저장했다가 `free()` 에 넘기는 패턴이며, 외부 사용자 입력이 아니다. 안전성
 검사를 추가하면 매 free 마다 O(N) 비용이 들어 시뮬 wall-clock 에 영향.
 이 신뢰 모델이 깨지면 (예: 두 텐서가 같은 PA 를 가리키는 코드 경로 도입)
 즉시 ADR-level 으로 재검토.
 ### D3. `PEMemAllocator` — HBM/TCM 두 채널 분리
 `PEMemAllocator(sip_id, die_id, pe_id, cfg)` 는 두 `_FreeList` 를 보유:
 - `_hbm`: `cfg.hbm_slice_bytes` 크기.
 - `_tcm`: `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
  tcm_scheduler_reserved_bytes`) 크기.
 `alloc_hbm(nbytes) -> PhysAddr`:
 - `_hbm.alloc(nbytes)` 로 offset 획득.
 - `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id, pe_local_hbm_offset=offset,
  slice_size_bytes=cfg.hbm_slice_bytes)` 로 PA 인코딩.
 - 실패 시 `AllocationError("HBM overflow ...")`.
 `free_hbm(pa, nbytes)`:
 - `pa.hbm_offset - pe_id * cfg.hbm_slice_bytes` 로 PE-local offset 복원.
 - `_hbm.free(offset, nbytes)`.
 `alloc_tcm(nbytes) -> PhysAddr`: 유사하게 `PhysAddr.pe_tcm_addr` 로 인코딩.
 `free_tcm(pa, nbytes)`: `pa.sub_offset` 을 그대로 사용 (TCM 은 PE-local
 offset 이 곧 sub_offset).
 scheduler-reserved TCM 영역 (`cfg.tcm_scheduler_reserved_bytes`) 은
 allocator 가 인지하지 않는다 (`_tcm` 의 capacity 에서 사전 차감되어 있음).
 이는 ADR-0014 의 PE_SCHEDULER 내부 buffer 예약과 정합된다.
 ### D4. `VirtualAllocator` — 페이지 정렬 first-fit + coalescing
 `policy/address/va_allocator.py::VirtualAllocator`:
 - 내부 표현: `_FreeList` 와 동일한 sorted `list[tuple[int, int]]`.
  최초: `[(va_base, va_size)]`.
 - `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
 - `alloc(nbytes) -> int`:
  1. `aligned = _align_up(nbytes)`.
  2. first-fit 으로 `size >= aligned` 인 블록 탐색.
  3. 블록 앞부분 `aligned` 만큼 잘라 사용. 정확히 일치하면 제거.
  4. `_used += aligned`. 블록 `start` (= aligned 된 VA) 반환.
  5. 실패 시 `VaAllocationError`.
 - `free(va, nbytes)`: `_align_up(nbytes)` 단위로 free. _FreeList 와 동일한
  coalesce 알고리즘.
 `page_size` 의 실제 값은 두 곳에서 다른 기본을 갖는다:
 - `VirtualAllocator.__init__` 의 매개변수 기본값: `2 MiB`. 직접 호출하는
  테스트가 그대로 받는다.
 - `RuntimeContext._ensure_allocators` 가 인스턴스화할 때:
  `pe_mmu.attrs.get("page_size", 4096)` — `topology.yaml` 의
  `pe_mmu.attrs.page_size` 가 있으면 그 값, 없으면 fallback 4 KiB.
 두 기본이 다른 이유: VirtualAllocator 의 standalone 기본은 ADR-0039 의
 PE_MMU stopgap 기본 (2 MiB) 과 정합되어 직접 테스트가 자연스럽고, context
 fallback 의 4 KiB 는 topology 미설정 시 안전한 minimum page 다. 실제 사용
 경로는 항상 후자이며 (`_ensure_allocators` 가 인스턴스화하므로),
 `topology.yaml` 에서 `page_size` 가 명시되면 그 값이 양쪽 (MMU + VA
 allocator) 으로 일관되게 흐른다.
 만약 이 일치가 깨지면 (예: VirtualAllocator 의 page_size 를 PE_MMU 와
 다르게 인스턴스화) MMU `map()` 가 서브-페이지 region 모드 (ADR-0039 D3) 로
 흐른다.
 VA 기본 범위: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64 GiB`. 이
 값은 `_ensure_allocators` 에 하드코딩되어 있으며 ADR-0011 의 VA 모델에서
 직접적인 의미를 갖지는 않는다 — 단지 host 코드와 충돌하지 않을 만큼 큰
 주소 공간을 device-wide 로 잡아둔 것.
 ### D5. allocator 인스턴스의 lifecycle
 - `RuntimeContext._ensure_allocators` 가 lazy 하게 호출됨 (`_create_tensor`
  의 첫 호출 시점).
 - 한 번 생성된 allocator dict (`self._allocators`) 는 RuntimeContext 의
  lifetime 동안 재사용. 같은 process 안의 두 번째 deploy 는 새 객체를
  만들지 않는다.
 - `RuntimeContext.cleanup()` 이 모든 living tensor 의 `_free_tensor()` 를
  호출 → MMU unmap + `va_allocator.free` + `pemem_allocator.free_hbm` 으로
  free list 가 원상복구. 다음 RuntimeContext 가 다시 만들면 초기 상태부터.
 allocator 상태가 RuntimeContext 간에 공유되지 않는 점이 단일 process 안의
 연속 실행에서 deploy → cleanup → deploy 의 결정성을 보장한다.
 ### D6. Allocator 실패는 raise 한다 (silent OOM 금지)
 `_FreeList.alloc` / `VirtualAllocator.alloc` 모두 충분한 free block 이
 없으면 `AllocationError` / `VaAllocationError` 를 던진다. 메시지에는
 "required size + largest available block" 가 포함되어, fragmentation
 인지 진짜 OOM 인지 진단 가능.
 silent fallback (예: 가장 큰 블록만큼만 alloc) 는 절대 금지 — 부분 할당된
 텐서가 SimPy 단계에 들어가면 라우팅·DMA 가 잘못된 PA 를 인지하여 시뮬
 정확도가 깨진다.
 ### D7. address space 와 allocator 의 1:1 대응
 물리 주소 공간 분리는 PhysAddr 의 sub-unit (ADR-0001 D2.3) 으로 표현되며,
 각 sub-unit 마다 별도 allocator 인스턴스를 둔다:
 - HBM slice → `PEMemAllocator._hbm`.
 - PE TCM → `PEMemAllocator._tcm`.
 - (현재 미사용) M_CPU local memory, CUBE SRAM → 별도 allocator 필요. 현재
  구현은 아직 IPCQ-only slot 으로 처리 (ADR-0023 D9.7) 하며 PA 공간을
  share 하지 않으므로 별도 free-list 가 없음.
 cube-level SRAM allocator 가 필요해지면 `_FreeList(cfg.sram_bytes_per_cube)`
 인스턴스를 cube 단위로 추가한다 (`cfg.sram_bytes_per_cube` 는 이미
 `AddressConfig` 에 정의되어 있어 데이터 모델은 준비됨).
 ## Alternatives Considered
 ### A1. best-fit / buddy allocator
 기각 (현재). 워크로드의 alloc/free 패턴이 stack-like (deploy 순서 = free
 순서) 라 first-fit + coalescing 으로 fragmentation 이 충분히 통제된다.
 LLM kernel sweep 에서 long-running fragmentation 이 관찰되면 buddy 로
 교체하는 ADR 을 별도로 만든다.
 ### A2. partial overlap free 검증 추가
 기각. D2 의 신뢰 모델 + O(N) 검사 비용. 단, 디버그 모드 (`KERNBENCH_DEBUG`
 env var 등) 에서 활성화하는 옵션은 후속 작업으로 가능.
 ### A3. VA 와 PA 의 통합 allocator
 기각. VA 공간 (64 GiB device-wide) 과 PA 공간 (slice 별 ~6 GiB) 는 의미
 차원이 다르다. VA 는 host kernel 의 view, PA 는 device sub-unit 의 view.
 ADR-0011 의 VA 모델 정신 (MMU 가 둘 사이를 매핑) 과 정합하기 위해
 allocator 도 분리.
 ### A4. page_size 의 multi-tier 지원 (large page + small page)
 기각 (현재). 단일 page_size (현재 2 MiB) 가 LLM kernel 의 텐서 단위 (수
 MiB~수 GiB) 에 맞고, ADR-0039 D3 의 서브-페이지 region 으로 작은 매핑이
 필요할 때 흡수된다. multi-tier page 는 MMU 자체 모델을 확장해야 하므로
 별도 ADR 후보.
 ## Consequences
 - allocator 알고리즘이 ADR-level 에서 굳어져 (D1·D3·D4), 새로운 시뮬
  시나리오에서 fragmentation 이슈가 발생할 때 "여기서 first-fit + coalesce
  를 쓰고 있다" 가 명확.
 - D2 의 신뢰 모델이 명시되어, 향후 사용자 입력으로부터 직접 alloc/free 를
  받는 경로가 도입되면 본 ADR supersede 가 필요함을 일찍 인지 가능.
 - D7 의 sub-unit별 allocator 1:1 대응이 명시되어, M_CPU/SRAM 별도 영역이
  필요해질 때 어디에 free-list 를 추가해야 하는지 명확.
 - `VirtualAllocator` 의 page_size 가 PE_MMU 설정과 일치해야 함이 D4 에
  적혀 있어, 향후 topology.yaml 의 page_size 변경 시 ADR-0039 stopgap 동작
  과의 상호작용을 빠르게 가늠 가능.
@@ -0,0 +1,231 @@
 # ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
 ## Status
 Accepted (2026-05-22).
 `probes/probe.py` 의 `run_probe(...)` 가 노출하는 traffic-pattern catalog,
 formula vs actual 비교, 그리고 monotonicity / D2H≥H2D 같은 invariant
 체크의 의미를 명시한다. ADR-0010 (CLI surface) 가 `kernbench probe`
 subcommand 를 enumerate 하나, **probe 가 실제로 측정하는 것**과 **어떤
 invariant 를 PASS/FAIL 로 판정하는가**는 ADR-level 에 없었다.
 ## First action (제일 처음에 하는 일)
 `run_probe(topology_path, case_filter=None)` 의 첫 4가지 작업:
 1. `Path(topology_path).expanduser().resolve()` 로 절대 경로 산출.
 2. `load_topology(path)` → `TopologyGraph` 인스턴스 (그래프 + spec).
 3. `_build_edge_map(graph)` → `{(src, dst): Edge}` 빠른 lookup 테이블.
 4. `AddressResolver(graph)` + `PathRouter(graph)` 인스턴스화.
 그 다음 `nbytes = 32768` (= 32 KiB, summary table 의 기준 데이터 크기) 와
 `show_all = (case_filter is None or case_filter == "all")` 를 설정.
 즉, **probe 의 첫 일은 "토폴로지를 한 번 로드하여 edge map / resolver /
 router 를 준비하고, 32 KiB 라는 표준 측정 크기를 픽스하는 것"**. 그 이후
 H2D → D2H → PE DMA 세 카테고리의 case 들이 각각 별도의 `GraphEngine`
 인스턴스에서 실행된다 (case 간 cross-talk 차단).
 ## Context
 `kernbench probe` 는 다음 의도로 도입된 verification 도구다:
 - **수동 분석 ground truth**: 실 시뮬레이션 (`kernbench run --bench ...`)
  결과의 latency 가 비정상으로 보일 때, 단순 traffic pattern 의 정답을 별도
  로 얻어 비교.
 - **formula vs actual 비교**: 분석 모델 (wire latency + overhead + drain)
  과 시뮬레이션 결과 (`total_ns`) 가 일치하는지 확인. 일치하지 않으면 모델
  단순화 가정 (ADR-0033) 어디가 빠진 것인지 단서.
 - **monotonicity check**: hop 수가 늘면 latency 가 단조 증가해야 한다는
  invariant 의 자동 확인.
 - **utilization sweep**: 데이터 크기 (4 KiB ~ 1 MiB) 별 BW 활용률 표.
 이 도구의 동작 사양이 ADR-level 에 없으면:
 - 다른 형식의 traffic pattern (예: MCpuDma, IPCQ) 을 추가하려는 사람이 기존
  카테고리의 표 포맷 / 측정 단위를 일관되게 따르기 어렵다.
 - monotonicity 가 무엇을 기준으로 검사되는지 (hop 수? cube 거리? wire
  길이?) 모호.
 - 32 KiB 라는 기준 크기와 `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]` sweep
  의 의미가 코드 grep 으로만 확인 가능.
 ## Decision
 ### D1. 세 가지 case category — H2D / D2H / PE DMA
 각 category 는 토폴로지 상 별개의 데이터 경로를 가지며, 별도의 summary
 table + sweep table + route detail block 으로 출력된다.
 - **H2D (Host→Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
  dst_pe=0, pattern="zero")` 가 `pcie_ep → io_cpu → m_cpu → hbm_ctrl` 경로
  를 흐른다. cube 인덱스로 hop 수가 증가:
  - h2d-1hop: cube=0, hops=1
  - h2d-2hop: cube=4, hops=2
  - h2d-3hop: cube=8, hops=3
  - h2d-4hop: cube=12, hops=4
 - **D2H (Device→Host Read)**: `MemoryReadMsg(src_sip=0, src_cube, src_pe=0)`.
  forward command path + reverse data path 의 합 latency. 같은 4 hops
  카테고리.
 - **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe, dst_pa)`.
  5 가지 케이스로 cube/PE 위치 변화:
  - pe-local-hbm: same cube, same PE
  - pe-same-half-hbm: same cube, different PE (PE 1)
  - pe-cross-half-hbm: same cube, far PE (PE 4)
  - pe-cross-cube-hbm-best: adjacent cube (cube 1)
  - pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
 cube 인덱스가 4/8/12 (H2D), 1/4/15 (PE DMA) 같이 의미 있는 이유는
 4x4 cube mesh (sip.cube_mesh.w=4, h=4) 에서의 거리 정의 — 추후 cube_mesh
 크기 변경 시 이 값들이 같이 갱신되어야 한다.
 ### D2. 표준 측정 크기 — `nbytes = 32768` (32 KiB)
 모든 case 의 summary table 은 `nbytes=32768` 로 한 번 실행한 결과를
 보여준다. 32 KiB 가 선택된 이유:
 - DMA overhead 와 BW drain 이 한쪽으로 치우치지 않는 적당한 크기.
 - 다수 sub-unit (TCM, register file) 의 1회 transfer 단위와 비교 가능.
 크기별 utilization 변화는 별도 sweep table 이 보여준다 (D3).
 ### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
 `SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`, `SWEEP_LABELS =
 ["4KB", "16KB", "64KB", "256KB", "1MB"]`. 매 size 마다 다음 공식:
 ```
 drain   = nbytes / bottleneck_bw
 total   = overhead + wire + drain
 eff_bw  = nbytes / total
 util%   = eff_bw / bottleneck_bw × 100
 ```
 `bn_bw is None or <= 0` 이면 그 컬럼은 0.0 % 로 출력. 의미: hop 수가 늘
 수록 작은 transfer 는 overhead-bound, 큰 transfer 는 drain-bound 가 되는
 패턴을 한 표에서 확인.
 ### D4. 측정 항목 — actual / formula / breakdown
 각 case 행에 표시되는 컬럼:
 - `Actual` (total_ns): SimPy 실행 결과의 `trace["total_ns"]`.
 - `Ovhd`: 경로상 모든 node 의 `node.attrs["overhead_ns"]` 합 (formula
  breakdown).
 - `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
 - `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
 - `Ovhd%` / `Drain%`: Ovhd/Drain 이 Actual 에서 차지하는 비율 (formula 의
  Wire 는 통상 매우 작아 표시하지 않음).
 - `Eff.BW`: `nbytes / total_ns` (실 측정 BW).
 - `BN.BW`: bottleneck bandwidth (formula). path 상 모든 edge 의 BW 중 최소.
  edge BW 가 없으면 "-".
 - `Util%`: `Eff.BW / BN.BW × 100`. 100% 면 single-stream BW upper bound 에
  도달.
 formula 의 합 (`wire + ovhd + drain`) 과 actual 의 차이가 크면 모델
 단순화가 잡지 못하는 요소가 있다는 신호 (ADR-0033 의 가정 점검).
 ### D5. Invariant 자동 체크 — PASS/FAIL
 다음 invariant 들이 자동으로 확인되어 `[v] PASS` / `[x] FAIL` 로 출력:
 - **H2D / D2H monotonic increase**: hop 수가 늘면 actual latency 가
  단조 증가해야 함. `all(lats[i] < lats[i+1] for ...)`.
 - **D2H ≥ H2D**: 같은 hop 인덱스에서 D2H ≥ H2D (D2H 는 forward command
  + reverse data 두 leg 이므로). `all(d2h[i].total >= h2d[i].total)`.
 - **PE DMA best < worst**: cross-cube best (adjacent) latency < cross-cube
  worst (diagonal) latency.
 - **PE DMA local vs remote**: local BN BW vs remote BN BW 의 비교 출력
  (PASS/FAIL 이 아닌 정보성).
 체크가 FAIL 이면 사람이 즉시 모델/토폴로지 회귀를 인지할 수 있도록 한
 줄로 분명하게 출력.
 ### D6. Route detail — per-hop timestamp trace
 summary 와 sweep 표 이후 각 case 의 path 와 per-hop 누적 시간 (
 `_hop_timestamps`) 가 별도 섹션에서 출력된다:
 - H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
  (`m_cpu → hbm_ctrl`) + per-hop trace.
 - D2H: forward (cmd, no data) + reverse (data) trace 분리 표시.
 - PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
 각 hop 의 timestamp 는 cumulative `wire_ns + overhead_ns` 누적. terminal
 hop 의 annotation 에 `drain:Xns` 가 붙는다. bottleneck edge 는
 `<BN:XXGB/s>` 로 표시되어 시각적으로 식별 가능.
 ### D7. case_filter 인자의 의미
 - `None` 또는 `"all"`: 모든 case 실행 (default).
 - 다른 문자열: 그 이름과 정확히 일치하는 case 만 실행. 예: `kernbench
  probe --case h2d-2hop`.
 각 카테고리 안에서 `name != case_filter` 면 skip 되며, 그 카테고리의
 monotonicity / D2H≥H2D 비교는 데이터가 1개일 때 자연히 skip 된다.
 CLI parser 의 `--case` 기본값은 `"all"`이라 인자 생략 시 전체 실행.
 ### D8. 매 case 별 fresh GraphEngine
 H2D 4개, D2H 4개, PE DMA 5개의 case 가 각각 **새로운 GraphEngine**
 인스턴스에서 실행된다 (`engine = GraphEngine(graph)`). 이유:
 - case 간 누적 상태 (op_log, completion 추적, allocator 등) 가 cross-talk
  하지 않도록 격리.
 - 한 case 의 traffic 이 다른 case 의 BW 측정에 영향을 주지 않도록 보장.
 이 격리는 probe 의 측정 결과를 **각 case 단독 single-flow** 의 latency 로
 해석할 수 있게 한다. multi-flow contention 측정은 별도 도구 (예:
 `pe2pe_overview` 플롯, ADR-0033 의 multi-flow merging 모델) 책임.
 ### D9. 출력 포맷의 안정성
 probe 의 stdout 출력은 사람이 읽기 위함이며, 정확한 컬럼 폭/구분자/공백 은
 machine-readable contract 가 아니다. 자동화된 도구가 probe 결과를 파싱
 하려면 별도 JSON 출력 모드를 추가해야 한다 (현재 미구현).
 PASS/FAIL 줄의 `[v]` / `[x]` 접두사는 CI grep 용 anchor 로 안정 보장.
 ## Alternatives Considered
 ### A1. Probe 를 별도 bench 로 등록 (`@bench(name="probe")`)
 기각. probe 는 bench 가 아니라 verification 도구로 의도된다 — sweep / 분석
 용 multi-engine 실행과 invariant PASS/FAIL 출력이 본질이며, ADR-0045 의
 "단일 디바이스 + 단일 RuntimeContext" bench 모델과 맞지 않는다.
 ### A2. monotonicity 위반 시 exit code 1
 기각 (현재). 인간 검사 도구 위주로 의도되어 있어 PASS/FAIL 줄을 출력하고
 exit 0 로 종료. CI 가 violation 으로 fail 하길 원하면 별도 wrapper 가
 `grep "\[x\]"` 결과로 판단하면 됨. 후속으로 strict-mode flag (`--strict`)
 도입 가능.
 ### A3. probe 의 case 정의를 외부 YAML 로
 기각 (현재). 8개 case (4 H2D + 4 D2H + 5 PE DMA — 합 13개) 는 코드에
 하드코딩되어 있고 의미가 토폴로지 mesh 구조에 단단히 묶여 있다. 외부
 YAML 로 옮기면 cube 인덱스의 의미 (4, 8, 12 / 1, 4, 15) 를 별도로 문서화
 해야 하므로 응집도 손실. 케이스 추가가 잦아지면 그때 별도 ADR 로 도입.
 ### A4. multi-flow contention 측정 추가
 기각 (probe 범위 밖). D8 에서 명시한 single-flow 격리 모델이 probe 의 핵심
 의도. multi-flow contention 은 ADR-0033 latency model 의 다른 영역으로,
 별도 도구 또는 별도 case category 로 처리.
 ## Consequences
 - probe 의 case catalog (D1) 와 측정 단위 (D2/D3) 가 ADR-level 에서 명시
  되어, 새 traffic 카테고리 추가 시 어떤 표 포맷을 따라야 하는지 분명.
 - formula vs actual 의 컬럼 의미 (D4) 가 굳어져, probe 결과를 보고 "왜
  Drain% 가 5% 인가 / 70% 인가" 같은 질문을 빠르게 ADR-0033 가정 점검으로
  연결 가능.
 - invariant 자동 체크 (D5) 가 ADR 에 굳어져, 향후 latency 모델 변경 시
  monotonicity / D2H≥H2D 회귀를 probe 가 즉시 잡아낸다는 안전망 정착.
 - D8 의 case 간 격리가 명시되어, probe 결과를 single-flow 측정으로 안전
  하게 해석 가능. multi-flow 측정이 필요해지면 별도 도구 트랙이 필요함이
  분명.
 - A2 의 strict-mode flag 가 후속 작업 후보로 기록되어, CI 통합 요구 시
  최소 추가 작업으로 도입 가능.
@@ -0,0 +1,308 @@
 # ADR-0050: CCL Algorithm Module Contract — `ccl/algorithms/*.py`
 ## Status
 Accepted (2026-05-22).
 `src/kernbench/ccl/algorithms/` 디렉터리 안의 모듈이 AHBM CCL backend
 (ADR-0047) 에서 collective algorithm 으로 사용되려면 갖춰야 할 인터페이스,
 kernel 시그너처, 그리고 새 알고리즘 추가 절차를 명시한다. ADR-0047 D3 가
 "algorithm 모듈은 `kernel`, `kernel_args`, optional `TOPO_NAME_TO_KIND` 를
 expose 해야 한다" 라고만 한 줄로 언급하나, **algorithm 모듈 작성자가 따라야
 할 contract** 는 ADR-level 에서 정리된 적이 없다. ADR-0045 가 bench 모듈
 contract 를 다루는 것과 짝을 이룬다.
 ## First action (제일 처음에 하는 일)
 알고리즘 모듈이 import 되는 시점은 두 가지다:
 1. **AHBM backend 진입**: 사용자 코드가 `dist.init_process_group(backend="ahbm")`
   를 호출하면, `AhbmCCLBackend.__init__` 안에서 `self._algo_module =
   importlib.import_module(self._merged["module"])` 가 실행된다. 이때 모듈
   레벨에서 가장 먼저 일어나는 일:
   - `SIP_TOPO_RING/TORUS/MESH` 같은 정수 상수가 모듈 namespace 에 노출.
   - `TOPO_NAME_TO_KIND` 사전이 모듈 namespace 에 노출 — backend 가
     `topo_map = getattr(self._algo_module, "TOPO_NAME_TO_KIND", None)` 로
     조회.
   - `kernel_args` 함수 정의 — 호출 시 호출자가 사용.
   - `allreduce_intercube_multidevice` 같은 알고리즘 함수 정의.
   - 모듈 마지막 줄에서 `kernel = allreduce_intercube_multidevice` 로
     alias 가 노출.
 2. **ccl.yaml install 단계**: `kernbench.ccl.install.install_ipcq` 가 호출
   되어 IPCQ neighbor table 을 푸시할 때 같은 알고리즘 모듈이 import 됨.
 즉, **algorithm 모듈의 첫 일은 "topology-kind 상수, `TOPO_NAME_TO_KIND`
 사전, `kernel_args` 함수, 그리고 `kernel` alias 를 모듈 namespace 에 노출
 하는 것"** 이다. 모든 노출은 import-time 부수효과로 충분하며 별도 초기화
 함수 호출이 필요하지 않다.
 ## Context
 `AhbmCCLBackend` (ADR-0047) 는 process group 초기화 시점에 `ccl.yaml` 의
 `defaults.algorithm` (또는 사용자가 지정한 알고리즘 이름) 으로부터 모듈
 경로를 얻어 dynamic import 한다. backend 는 그 모듈로부터 다음 4 가지를
 기대한다:
 - `kernel`: collective 의 진입 함수.
 - `kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple`: kernel 에
  넘길 위치 인자 묶음.
 - `TOPO_NAME_TO_KIND` (optional): `topology.yaml` 의 `sips.topology`
  문자열 (예: `"ring_1d"`, `"torus_2d"`, `"mesh_2d_no_wrap"`) 을 정수
  상수로 매핑하는 dict.
 - (간접) IPCQ neighbor table 설치: `configure_sfr_intercube_multisip` 가
  알고리즘 모듈의 `TOPO_NAME_TO_KIND` 와 `cube_w/h` 를 보고 SFR 을 결정.
 현재 코퍼스의 유일한 algorithm 모듈은 `lrab_hierarchical_allreduce.py`
 (248 줄) 이다. 이름은 "**l**eft-**r**ight **a**lternating **b**roadcast
 **hierarchical allreduce**". 향후 `ring_allreduce`, `tree_allreduce`,
 `broadcast` 같은 모듈이 추가될 때마다 이 contract 를 따라야 일관된
 디스패치가 가능하다.
 이 contract 가 ADR-level 에 없으면:
 - 새 algorithm 작성자가 ADR-0047 D3 의 한 줄 만으로 시그너처를 추론해야.
 - kernel 함수 인자 순서 (특히 `t_ptr, n_elem, cube_w, cube_h, n_sips,
  sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, tl`) 의 의미가 코드
  grep 없이는 명확하지 않다.
 - `kernel_args` 가 어떤 인자를 받고 어떤 tuple 을 돌려줘야 하는지 관례
  로만 굳어진다.
 ## Decision
 ### D1. algorithm 모듈은 4 가지 public symbol 을 노출한다
 ```python
 # src/kernbench/ccl/algorithms/<name>.py
 from __future__ import annotations
 # (필수) topology-kind 상수 — 알고리즘 내부에서 사용
 SIP_TOPO_RING = 0
 SIP_TOPO_TORUS = 1
 SIP_TOPO_MESH = 2
 # (선택) topology 이름 → kind 매핑. backend 가 ccl.yaml/topology 의
 # 문자열 SIP topology 를 정수로 변환하는 데 사용.
 TOPO_NAME_TO_KIND = {
    "ring_1d": SIP_TOPO_RING,
    "torus_2d": SIP_TOPO_TORUS,
    "mesh_2d_no_wrap": SIP_TOPO_MESH,
 }
 # (필수) kernel 인자 빌더
 def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
    return (n_elem, cube_w, cube_h, world_size)
 # (필수) kernel 함수 (`tl=...` 키워드를 통해 TLContext 가 주입됨)
 def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips,
                         sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl):
    ...
 # (필수) kernel alias — backend 가 `module.kernel` 로 접근
 kernel = my_allreduce_kernel
 ```
 - `kernel` alias 는 backend 가 직접 호출하는 entry point 다. 함수 이름이
  무엇이든 (`allreduce_intercube_multidevice` 처럼) `module.kernel = fn`
  으로 노출해야 한다.
 - `kernel_args` 가 없으면 backend 가 알고리즘 인자를 만들 방법이 없다.
  signature 는 D2 참고.
 - `TOPO_NAME_TO_KIND` 가 없으면 backend 는 `sip_topo_kind = 0` 으로
  fallback 한다. 단일 topology 만 지원하는 알고리즘이라면 생략 가능.
 ### D2. `kernel_args` 시그너처 — `(world_size, n_elem, *, cube_w, cube_h)`
 ```python
 def kernel_args(world_size: int, n_elem: int, *,
                cube_w: int = 4, cube_h: int = 4) -> tuple:
    return (n_elem, cube_w, cube_h, world_size)
 ```
 - **위치 인자**: `world_size` (= rank 수), `n_elem` (= 단일 shard 의
  element 수, f16 기준).
 - **키워드 인자**: `cube_w`, `cube_h` (= cube mesh 크기). default 는
  4×4 — `topology.yaml` 의 `sip.cube_mesh` 기본값과 정합.
 - **반환**: kernel 의 위치 인자 순서대로 묶은 tuple.
 backend 의 `all_reduce` 가 호출 시:
 ```python
 kernel_args_tuple = self._algo_module.kernel_args(
    self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h,
 )
 extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)
 pending = self.ctx.launch(
    self._merged["algorithm"], kernel_fn, tensor,
    *kernel_args_tuple, *extra_args, _defer_wait=True,
 )
 ```
 즉 kernel 의 최종 위치 인자는: `(tensor_ptr, *kernel_args_tuple,
 sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` 이며, 거기에 `tl=...` 가
 키워드로 자동 주입된다. `kernel_args` 가 돌려주는 tuple 의 길이/순서는
 **kernel signature 와 1:1 일치** 해야 한다.
 ### D3. `kernel` 함수 시그너처 — 정형화된 9 + tl 인자
 권장 시그너처:
 ```python
 def my_kernel(
    t_ptr: int,             # VA base of the row-wise-sharded tensor on this SIP
    n_elem: int,            # element count per cube tile (or per shard)
    cube_w: int,            # cube mesh width  (kernel_args 에서 옴)
    cube_h: int,            # cube mesh height (kernel_args 에서 옴)
    n_sips: int,            # world_size 와 동일 (rank = SIP, ADR-0024)
    sip_rank: int,          # 이 SIP 의 rank
    sip_topo_kind: int,     # TOPO_NAME_TO_KIND lookup 결과
    sip_topo_w: int,        # SIP mesh width (ring_1d 면 0)
    sip_topo_h: int,        # SIP mesh height (ring_1d 면 0)
    *, tl,                  # TLContext (auto-injected)
 ) -> None:
 ```
 `kernel_args` 가 다른 위치 인자 순서를 채택하더라도, kernel 의 **마지막
 4 개 위치 인자는 항상 `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`**
 이며 backend 가 `extra_args` 로 append 한다 (ADR-0047 D5). 이 4 개 인자는
 사용자 정의 algorithm 도 받아야 하지만, 알고리즘이 single-SIP 이라면
 그냥 무시하면 된다.
 `tl` 은 위치 인자가 아닌 키워드로 주입된다 — `RuntimeContext.launch` 가
 kernel 호출 직전에 `tl=tl_ctx` 를 추가한다. 따라서 kernel signature 의
 `tl` 은 keyword-only (`*, tl`) 또는 마지막 키워드 매개변수 형태여야
 한다.
 ### D4. kernel body 의 자유도와 제약
 kernel body 안에서 사용 가능한 표면: ADR-0046 D3 의 모든 `tl.*` primitive.
 특히 자주 쓰이는 패턴:
 - `cube_id = tl.program_id(axis=1)` — 이 PE 가 속한 cube 인덱스.
 - `pe_addr = t_ptr + cube_id * nbytes` — cube-별 tile 의 VA 계산.
 - `acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")` — local 데이터
  로드.
 - `tl.send(dir=...)` / `tl.recv(dir=..., shape=, dtype=)` — IPCQ
  collective.
 - `acc = acc + recv` — TensorHandle 산술 연산자 (ADR-0046 D4).
 - `tl.store(pe_addr, acc)` — 결과 저장.
 kernel body 는 일반 Python 함수이며, branching/looping 자유. 단:
 - SimPy `yield` 또는 `async` 금지 (ADR-0046 D1).
 - TensorHandle 의 `.data` 직접 접근 금지 — phase 1 timing 모델은
  데이터 의존을 모른다 (ADR-0020 의 2-pass 분리).
 - kernel 실행은 deterministic 해야 한다 — 같은 입력으로 두 번 실행하면
  같은 op 시퀀스 발사. random / external IO 금지.
 ### D5. SIP topology semantics — `sip_topo_kind` 의 의미
 backend 가 `topology.yaml` 의 `system.sips.topology` 문자열을 algorithm
 모듈의 `TOPO_NAME_TO_KIND` 로 lookup 하여 `sip_topo_kind` 정수로 변환.
 algorithm 은 이 정수를 보고 분기:
 ```python
 if sip_topo_kind == SIP_TOPO_RING:
    acc = _inter_sip_ring(...)
 elif sip_topo_kind == SIP_TOPO_TORUS:
    acc = _inter_sip_torus_2d(...)
 elif sip_topo_kind == SIP_TOPO_MESH:
    acc = _inter_sip_mesh_2d(...)
 ```
 각 topology branch 는 IPCQ direction 이름 (예: `"global_E"`, `"W"`, `"S"`,
 `"N"`) 을 통해 peer 와 통신. direction 의 의미는 ADR-0023/0025 가 정의
 하며, `configure_sfr_intercube_multisip` 가 IPCQ neighbor table 을 그에
 맞춰 설치한다.
 algorithm 모듈은 자기가 지원하지 않는 topology kind 가 들어오면 silent
 no-op 으로 두기보다 명시적으로 `raise ValueError(f"unsupported topology
 kind {sip_topo_kind}")` 하는 것을 권장 — 실수로 backend 에 잘못 dispatch
 된 경우 빠르게 fail.
 ### D6. ccl.yaml 의 algorithm entry 구조
 algorithm 모듈은 `ccl.yaml` 의 entry 와 짝을 이룬다 (ADR-0023 D10 +
 ADR-0047 D3):
 ```yaml
 defaults:
  algorithm: lrab_hierarchical_allreduce
  n_elem: 8
 algorithms:
  lrab_hierarchical_allreduce:
    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
    # optional: world_size override
    # optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip
 ```
 - `module`: full Python module path. backend 의 `importlib.import_module`
  가 이 문자열을 그대로 사용.
 - `world_size` (optional): 명시되면 topology fallback 을 override
  (ADR-0047 D2).
 - algorithm-specific parameters 는 `configure_sfr_intercube_multisip` 가
  소비.
 새 algorithm 추가 시:
 1. `src/kernbench/ccl/algorithms/<name>.py` 작성 (D1 컨벤션).
 2. `ccl.yaml` 의 `algorithms` 섹션에 entry 추가.
 3. (필요 시) `kernbench.ccl.sfr_config` 에 SFR 설치 분기 추가.
 4. test 추가 (예: `tests/sccl/test_<name>.py`, ADR-0043 의 eval harness
   확장).
 ### D7. legacy "rank = flat PE index" 모드
 ADR-0047 D2 가 명시한 `ccl.yaml` 의 `world_size` override 경로는 legacy
 "rank = flat PE index" 테스트가 사용한다. algorithm 모듈은 이 모드 에서도
 `n_sips=world_size` 만큼의 rank 가 들어옴을 가정하면 된다 — backend 가
 rank↔(SIP, cube, PE) 매핑을 사전에 분리해 두므로 algorithm 본체에서는
 modal 분기가 필요 없다.
 단, single-cube workload 에서는 `cube_w=cube_h=1` 이 들어와 mesh-기반
 phase 들이 skip 되도록 작성해야 한다 (`lrab_hierarchical_allreduce.py`
 의 `single_cube = (cube_w == 1 and cube_h == 1)` 패턴 참고).
 ## Alternatives Considered
 ### A1. algorithm 모듈을 class 로 구조화 (`class Allreduce: kernel(...)` 등)
 기각. Python 모듈 namespace 자체가 algorithm 의 identity 로 사용 중이며
 (ADR-0047 D3 의 `importlib.import_module`), class 한 겹은 추가 indirection
 만 늘리고 dispatch 측 코드를 두텁게 만든다. 모듈-레벨 free function
 + `kernel` alias 패턴이 충분히 명확.
 ### A2. `kernel_args` 를 명시적 dataclass 로 typing
 기각 (현재). algorithm 마다 인자 갯수가 다른 것이 정상이며, dataclass 한
 종류를 강제하면 다양한 algorithm 간 호환이 어려워진다. tuple 반환은 simple
 하고 backend 측 `*kernel_args_tuple` unpacking 과 깨끗이 맞물린다.
 algorithm 별 자체 타입 강도가 필요해지면 그 algorithm 모듈 안에서 NamedTuple
 사용은 자유.
 ### A3. SFR 설치를 algorithm 모듈 안으로
 기각. SFR 설치 (`configure_sfr_intercube_multisip`) 는 topology + algorithm
 모두를 보고 IPCQ neighbor table 을 설치하는 cross-module 결정이라, algorithm
 모듈 내부보다 `kernbench.ccl.sfr_config` 같은 전용 위치가 자연스럽다. D6 의
 "필요 시 sfr_config 분기 추가" 워크플로우가 책임 분리 측면에서 더 명확.
 ### A4. algorithm name 을 모듈 namespace 에 자동 등록 (decorator)
 기각. ADR-0045 (bench) 와 달리 algorithm 은 ccl.yaml entry 와 직접 묶여
 있어 추가 등록 레지스트리가 중복이다. `module` 문자열 매핑 하나면 충분.
 ## Consequences
 - ADR-0047 D3 의 한 줄 contract 가 D1–D7 의 작성자-친화적 가이드로 확장
  되어, 새 algorithm 추가 시 시그너처를 grep 으로 추론할 필요 없음.
 - D3 의 9 + tl 인자 시그너처가 표준화되어, backend 의 `extra_args` append
  (ADR-0047 D5) 와 자연스럽게 맞물림. 향후 single-SIP-only algorithm 도
  4 개의 sip_* 인자를 받아야 함이 명시.
 - D5 의 fail-loud 권장으로, ccl.yaml 의 topology 가 algorithm 미지원
  topology 로 잘못 설정되면 backend 가 silent wrong-result 가 아닌
  ValueError 로 fail.
 - D6 의 단계별 추가 절차가 명시되어, 새 algorithm 추가가 sfr_config /
  test / ccl.yaml 어디까지 손대야 하는지 분명.
@@ -0,0 +1,267 @@
 # ADR-0051: Routing Helper API — `AddressResolver` + `PathRouter`
 ## Status
 Accepted (2026-05-22).
 `policy/routing/router.py` 가 노출하는 두 helper 클래스
 (`AddressResolver`, `PathRouter`) 의 모든 public API, 인자, 반환 값,
 그리고 네 가지 다른 adjacency graph 의 사용처를 명시한다. ADR-0002 가
 routing distance 와 ordering, bypass 규칙을 정의하나, **helper API 표면
 자체** 는 ADR-level 에 정리된 적이 없다.
 ## First action (제일 처음에 하는 일)
 ### `AddressResolver(graph)`
 생성 즉시 다음 두 가지를 캐시한다:
 1. `self._node_ids = set(graph.nodes)` — 모든 node id 의 set (lookup 용).
 2. `self._hbm_slice_bytes = hbm_total_gb * (1 << 30) // slices_per_cube` —
   `graph.spec.cube.memory_map` 으로부터 산출 (기본 `48 GB / 8 slices = 6
   GB`). 이 값이 `resolve()` 가 HBM PA 의 `hbm_offset` 에서 `pe_id` 를
   복원하는 데 쓰인다.
 즉, **AddressResolver 의 첫 일은 "전체 node id 집합과 HBM slice 크기를
 미리 계산해 두는 것"** 이다. graph 자체는 보유하지 않는다.
 ### `PathRouter(graph)`
 생성 즉시 **네 개의 별도 adjacency graph 를 동시 구축**한다:
 1. `self._adj_all`: 모든 edge 포함 (component-to-component routing 용).
 2. `self._adj`: `kind != "command"` 인 edge 만 (PE DMA / 일반 data path).
 3. `self._adj_mcpu_dma`: `_MCPU_DMA_EXCLUDE = {"pe_internal",
   "pe_to_router"}` 를 제외 (M_CPU DMA 가 PE pipeline 노드로 잘못 라우팅
   되지 않게).
 4. `self._adj_local`: `_UCIE_KINDS` 8 종을 제외 (cube-local routing 용 —
   UCIe 가 zero-distance bus 처럼 보여 Dijkstra 가 mesh 보다 선호하는
   것을 막음).
 각 그래프는 `defaultdict(list)` of `(neighbor, weight)` 형태이며,
 `edge.routing_weight_mm or edge.distance_mm` 이 weight 로 쓰인다.
 즉, **PathRouter 의 첫 일은 "topology edge 들을 4개의 다른 정책으로 동시
 분류하여 4 개의 인접 리스트로 구축하는 것"**. 매 `find_*()` 호출 시 적절
 한 그래프를 골라 Dijkstra 를 돌린다.
 ## Context
 `policy/routing/router.py` 는 다음 두 책임을 함께 수행한다:
 - **이름 매핑**: 토폴로지 명명 규칙 (`sip{S}.cube{C}.<comp>`,
  `sip{S}.io{I}.pcie_ep` 등) 의 단일 소유자. 컴포넌트 / probe / IPCQ
  install / runtime API 가 이름 문자열을 직접 만들지 않고 helper 를 호출.
 - **경로 결정**: edge 의 `kind` 에 따른 정책 분리. 같은 src→dst 라도
  routing 의도 (PE DMA vs M_CPU DMA vs general component routing) 에 따라
  다른 adjacency 를 사용해야 결과가 달라진다.
 이 helper API 가 코드 전반에서 광범위하게 소비되는데도 (probe.py /
 distributed.py / install.py / 각종 component / tests), ADR-level 에서
 **정확한 시그너처 / 반환 의미 / 어떤 adjacency 를 쓰는지** 가 한 곳에
 정리되어 있지 않다. 본 ADR 이 그 빈자리를 채운다.
 ## Decision
 ### D1. `AddressResolver` 의 5 개 public API
 #### D1.1. `resolve(addr: PhysAddr) -> str`
 `PhysAddr` 인스턴스를 토폴로지의 destination node id 로 변환.
 ```
 addr.kind == "hbm"             → f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
  where pe_id = addr.hbm_offset // self._hbm_slice_bytes  (ADR-0017 D4/D9)
 addr.kind == "pe_resource":
  addr.unit_type == PE         → f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
  addr.unit_type == SRAM       → f"sip{s}.cube{d}.sram"
  addr.unit_type == MCPU       → f"sip{s}.cube{d}.m_cpu"
  그 외                          → RoutingError("unsupported unit_type")
 다른 kind                       → RoutingError("unsupported address kind")
 ```
 산출된 node id 가 `self._node_ids` 에 없으면 `RoutingError(f"node {node_id}
 not found in topology")`. 즉, address 의 syntax 가 valid 해도 topology 에
 실제로 매핑되는 노드가 없으면 fail-loud.
 #### D1.2. `find_m_cpu(sip, cube) -> str`
 `f"sip{sip}.cube{cube}.m_cpu"`. 없으면 `RoutingError`.
 #### D1.3. `find_pcie_ep(sip, io_id="io0") -> str`
 `f"sip{sip}.{io_id}.pcie_ep"`. 없으면 `RoutingError`.
 #### D1.4. `find_io_cpu(sip, io_id="io0") -> str`
 `f"sip{sip}.{io_id}.io_cpu"`. 없으면 `RoutingError`.
 #### D1.5. `find_all_pcie_eps() -> list[str]`
 전 SIP 의 PCIE_EP node id 를 정렬된 리스트로 반환. `endswith(".pcie_ep")`
 필터링. cross-SIP IPCQ 가 모든 PCIE_EP 를 enumerate 할 때 사용.
 명명 규칙 (`sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`) 의 단일
 소유자가 이 클래스다 (ADR-0015 D4). 토폴로지 빌더가 같은 명명 규칙으로
 노드를 만들고, 컴포넌트는 이름 문자열을 절대 직접 구성하지 않는다 —
 모두 helper 를 거친다.
 ### D2. `PathRouter` 의 4 개 adjacency graph
 생성자가 한 번에 구축. edge `kind` 가 정책을 결정:
 | graph             | 제외 edge kinds                               | 용도                                       |
 |-------------------|-----------------------------------------------|--------------------------------------------|
 | `_adj_all`        | (none)                                        | M_CPU↔NOC command 포함, IO_CPU/M_CPU routing |
 | `_adj`            | `"command"`                                   | PE DMA / 일반 data path                    |
 | `_adj_mcpu_dma`   | `"pe_internal"`, `"pe_to_router"`            | M_CPU DMA (PE pipeline 우회)               |
 | `_adj_local`      | `_UCIE_KINDS` (`ucie_internal`, `ucie_conn_to_router`, `router_to_ucie_conn`, `ucie_conn_to_noc`, `noc_to_ucie_conn`, `ucie_mesh`, `io_to_cube`, `cube_to_io`) | same-cube routing (UCIe bus 우회) |
 각 그래프는 `dict[node_id, list[(neighbor, weight)]]` 이며, weight 는
 `edge.routing_weight_mm or edge.distance_mm`. command edge 의 routing
 영향력을 명시적으로 가르고, UCIe 의 "0-distance bus" 가 mesh 보다 선호
 되는 것을 막기 위한 `_adj_local` 분리가 ADR-0017 D7 의 cross-PE-slice
 mesh-distance 요구와 정합.
 ### D3. `PathRouter` 의 6 개 public API (+ 2 backward-compat)
 #### D3.1. `find_path(src_pe: str, dst_node: str) -> list[str]`
 **PE DMA routing**. `src_pe` 는 PE prefix (예: `"sip0.cube0.pe0"`) 이며,
 함수가 `.pe_dma` 를 자동으로 prepend 하여 실제 시작 노드를
 `"sip0.cube0.pe0.pe_dma"` 로 설정.
 cube-local 여부 (`_same_cube`) 에 따라 adjacency 선택:
 - **same-cube** (src 와 dst 가 `sip{S}.cube{C}.` prefix 공유):
  `_adj_local` 사용. UCIe 우회를 막아 cross-PE-slice 가 mesh 거리를 정확
  히 지불 (ADR-0017 D7).
 - **cross-cube**: `_adj` 사용. UCIe 가 자연스럽게 cross-cube path 의
  최적 선택지로 포함됨.
 #### D3.2. `find_path_with_distance(src_pe, dst_node) -> tuple[list[str], float]`
 D3.1 과 동일한 adjacency 정책을 사용하나, 결과로 `(path, total_distance)`
 를 함께 반환. probe / 분석 도구에서 distance 메트릭이 필요할 때 사용.
 #### D3.3. `find_mcpu_dma_path(m_cpu_id: str, dst_hbm_id: str) -> list[str]`
 **M_CPU DMA path**. cube 가 같으면 `_adj_local` (mesh 안에서 마무리), 다르
 면 `_adj_all` (UCIe 경유). `_MCPU_DMA_EXCLUDE` 가 PE pipeline 노드를 자동
 배제하므로, M_CPU 가 PE 의 내부 stage 를 거쳐 routing 되는 잘못된 경로가
 나오지 않는다.
 #### D3.4. `find_memory_path(src: str, dst: str) -> list[str]`
 `pcie_ep → io_noc → cube → router mesh → hbm_ctrl` 같은 직접 메모리
 경로. `_adj_mcpu_dma` 를 사용하여 `pe_internal` 및 `pe_to_router` edge
 를 제외 — host-issued read/write 가 PE pipeline 으로 새지 않게 보장.
 probe (ADR-0049 D1 의 H2D/D2H case) 에서 직접 호출.
 #### D3.5. `find_node_path(src: str, dst: str) -> list[str]`
 임의의 두 node 사이의 path. **command edge 포함** (`_adj_all` 사용). M_CPU
 ↔ NOC 같은 command-kind link 를 거쳐야 하는 IoCpuComponent /
 MCpuComponent 등이 호출.
 #### D3.6. backward-compat shims
 - `_dijkstra(start, goal) -> list[str]` — `_run_dijkstra(self._adj, …)`
  의 thin wrapper.
 - `_dijkstra_with_dist(start, goal) -> tuple[list[str], float]` — distance
  포함 버전.
 언더스코어 prefix 에서 보듯이 내부 API 인 척이지만 기존 테스트가 직접
 호출. 새 코드는 D3.1–D3.5 를 사용하고, 이 두 shim 은 deprecation 후보.
 ### D4. Dijkstra 알고리즘 — single-source shortest path
 `_run_dijkstra_with_dist(adj, start, goal)`:
 - `heapq` priority queue.
 - `best: dict[node, distance]` — 노드별 최단 거리 캐시.
 - `prev: dict[node, predecessor]` — path reconstruction.
 - weight 는 `routing_weight_mm or distance_mm`. UCIe 처럼 routing_weight 가
  명시되어 distance 와 다른 edge 가 있으므로 weight 분리가 의도된 것.
 `start == goal` 은 빠른 path `([start], 0.0)` 반환. 도달 불가는
 `RoutingError(f"no path from {start} to {goal}")`.
 이 알고리즘은 **deterministic** 하다 — 같은 graph + start/goal 이면 같은
 경로. 이는 SPEC R1 의 "Routing MUST be deterministic" 요구와 정합. tie-
 break 는 `heapq` 의 push 순서를 따른다 (Python list 순서가 deterministic).
 ### D5. helper API 의 단일 소유자 원칙
 다음 정보는 오직 router.py 안에서만 결정된다:
 - 명명 규칙: `sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`,
  `sip{S}.cube{C}.hbm_ctrl.pe{pe_id}`.
 - adjacency 정책: 어떤 edge kind 가 어떤 그래프에 포함되는가.
 - HBM slice 크기로부터 PE id 복원 방법.
 - Dijkstra의 weight 결정 (`routing_weight_mm or distance_mm`).
 이 단일 소유자 원칙이 깨지면 (예: 컴포넌트가 자체적으로 `f"sip{s}..."` 를
 구성하기 시작하면) 명명 규칙 변경 시 영향 범위가 폭발한다. ADR-0015 D4 의
 정신과 정렬.
 ### D6. helper API consumer 의 목록
 본 helper 가 노출하는 메소드를 호출하는 곳을 명시 (현재 코퍼스 기준):
 - `probes/probe.py` (ADR-0049): `find_pcie_ep`, `find_io_cpu`,
  `find_m_cpu`, `find_node_path`, `find_mcpu_dma_path`,
  `find_memory_path`, `find_path`, `resolve`.
 - `runtime_api/distributed.py` (ADR-0047): 간접 (engine 내부 routing).
 - `ccl/install.py` (ADR-0023): `find_all_pcie_eps`, `resolve`.
 - `sim_engine/event_log.py`: probe 와 유사하게 `find_pcie_ep`,
  `find_memory_path`.
 - `components/builtin/m_cpu.py`, `components/builtin/io_cpu.py`:
  `find_node_path`, `find_mcpu_dma_path`.
 - 각종 tests (test_routing.py, test_cross_sip_routing.py 등): D3.1–D3.5
  대부분.
 새 consumer 가 추가될 때 본 ADR 의 D1/D3 가 그 의도에 맞는 메소드가
 이미 있는지 / 새 메소드를 추가해야 하는지 1차 판단의 기준이 된다.
 ## Alternatives Considered
 ### A1. 단일 adjacency graph + edge-kind filter 동적 적용
 기각. 매 `find_*()` 마다 graph filtering 을 다시 하면 Dijkstra 의 cache
 locality 와 성능이 떨어진다. 4 개 그래프 동시 구축 (D2) 은 메모리 비용
 이 작고 (edge ≤ 수만 건 규모), 호출 시점에 정책 선택이 O(1) 로 결정.
 ### A2. adjacency 분리를 edge 의 `kind` 가 아닌 별도 metadata 로
 기각. edge `kind` 는 이미 topology builder 가 부여하며 (ADR-0015 D4 +
 ADR-0017), 별도 metadata 를 도입하면 두 시스템이 동기화되어야 하는
 중복이 생긴다.
 ### A3. Dijkstra 대신 BFS + uniform weight
 기각. routing_weight_mm 이 edge 별로 다른 (mesh link / UCIe / IO-internal)
 현실에서 BFS 는 hop 수 최소화일 뿐 latency / distance 최단을 보장하지
 않는다. SPEC R1 + R2 의 결정적·정확한 routing 요구에 어긋남.
 ### A4. helper API 를 클래스 메서드가 아닌 모듈 함수로
 기각. 두 클래스 (`AddressResolver`, `PathRouter`) 가 각각 cache 상태
 (`_node_ids`, `_hbm_slice_bytes`, 4 adjacency graphs) 를 보유해야 하며,
 같은 graph 인스턴스에 여러 routing 질의가 발생한다. 모듈 함수는 매 호출
 시 state 를 다시 만들거나 global 로 두어야 해서 안전성/성능 저하.
 ## Consequences
 - 컴포넌트 / probe / IPCQ install / runtime API 가 모두 router.py 의
  helper 만 호출하면 명명 규칙 변경 (예: `.io0.` → `.iochiplet0.`) 이
  단 한 파일 수정으로 끝남 (D5).
 - D2 의 4 그래프 분리가 ADR 에 굳어져, 새 edge kind 가 추가될 때 (예:
  Inter-die UCIe link 의 새 kind) 어느 그래프에 포함시킬지 결정의 명확
  한 기준 제공.
 - D3.1 의 cube-local vs cross-cube 분기 (ADR-0017 D7) 가 명시되어, 향후
  routing 동작을 변경하려는 사람이 어느 adjacency 를 건드려야 할지 안다.
 - D6 의 consumer 목록이 명시되어, helper API 변경 시 PR review 범위가
  분명. backward-compat shim (D3.6) 의 deprecation 후보가 식별됨.
@@ -0,0 +1,352 @@
 # ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals
 ## Status
 Accepted (2026-05-22).
 `sim_engine/op_log.py` 의 `OpRecord` 스키마와 `OpLogger` 의 record_start /
 record_end / record_copy 동작, 그리고 `sim_engine/memory_store.py` 의
 `MemoryStore` 가 사용하는 (space, addr) 주소공간 namespace 와 read/write
 의미를 명시한다. ADR-0020 (2-pass data execution) 가 두 인프라의 존재를
 선언하나, **레코드의 정확한 필드와 의미** 는 ADR-level 에서 정리되지
 않았고 ADR-0046 D3.2 (`tl.store` visibility), ADR-0023 D9 (IPCQ copy
 record) 등 여러 ADR 이 이들의 동작에 의존하고 있다.
 ## First action (제일 처음에 하는 일)
 ### `OpLogger(memory_store=None)`
 생성 즉시 다음 3 가지 필드 초기화:
 1. `self._records: list[OpRecord] = []` — 누적된 op record.
 2. `self._pending: dict[int, dict] = {}` — `id(msg)` 키로 partial record
   (record_start 시점에 만들어졌고 record_end 가 아직 안 온 것).
 3. `self._memory_store = memory_store` — 옵션 MemoryStore reference.
   math op 의 input 스냅샷 + dma_write 의 HBM source 스냅샷 캡처에 사용.
 생성 시점에는 records / pending 모두 비어 있으며, `record_*` 호출이
 순차적으로 데이터를 누적한다.
 ### `MemoryStore()`
 생성 즉시 `self._storage: dict[str, dict[int, np.ndarray]] = {}` 단 하나
 의 필드 초기화. 두 단계 dict (`space → addr → ndarray`) 이며 lazy 하게
 필요한 space 가 생길 때마다 inner dict 가 채워진다.
 즉, **두 인프라의 첫 일은 "비어 있는 누적 buffer + space-별 sparse dict
 를 만들어 두는 것"** 이다. 첫 record / write 가 실제로 도착하면 그때
 필드가 채워지기 시작한다.
 ## Context
 ADR-0020 (2-pass data execution) 의 D2/D5/D7 가 다음을 선언:
 - Phase 1 (timing) 동안 `ComponentBase._on_process_start/end` hook 이
  `OpLogger.record_start/end` 를 호출하여 모든 data op 의 시간 + 메타
  데이터를 기록.
 - Phase 2 (data) 가 op_log 를 t_start 순으로 재생하여 실 데이터 결과를
  계산.
 - 데이터 페이로드 자체는 `MemoryStore` 에 (space, addr) 키로 보관.
 ADR-0023 D9 (IPCQ atomic write), ADR-0027 (Megatron TP scratch
 overwrite 회피), ADR-0046 D3.2 (`tl.store` visibility) 등 후속 ADR 들이
 op_log 와 MemoryStore 의 동작에 의존하지만, **정확한 record 필드 / space
 이름 / 스냅샷 시점** 은 코드 grep 으로만 확인 가능하다. 본 ADR 이 이를
 정리한다.
 ## Decision
 ### D1. `OpRecord` 스키마 — 7 개 필드
 ```python
@dataclass
 class OpRecord:
    t_start: float
    t_end: float
    component_id: str
    op_kind: str               # "memory" | "gemm" | "math" | "unknown"
    op_name: str               # e.g. "dma_read", "gemm_f16", "exp",
                               #     "TileToken/DMA_READ", "composite_gemm",
                               #     "ipcq_copy"
    params: dict[str, Any]
    dependency_ids: list[int] = field(default_factory=list)
 ```
 - **`t_start` / `t_end`**: SimPy 시간 (float ns). `t_start` 는 component
  가 op 를 시작한 시점, `t_end` 는 완료 시점. duration = `t_end - t_start`.
 - **`component_id`**: op 가 발생한 node id (예:
  `"sip0.cube0.pe0.pe_dma"`).
 - **`op_kind`**: 4 가지 중 하나. Phase 2 DataExecutor 가 이 값으로 분기.
 - **`op_name`**: 디버깅 / 분석용 사람-친화 이름. TileToken 일 경우
  `"TileToken/{stage_type}"` (예: `"TileToken/DMA_READ"`) 로 stage 를
  구분.
 - **`params`**: op-종속 메타데이터 dict (D3 참고).
 - **`dependency_ids`**: 현재 사용되지 않음 (default `[]`). 향후 cross-op
  dependency 추적이 필요해질 때를 위한 자리.
 ### D2. `OpLogger.records` — t_start 정렬 보장
 ```python
@property
 def records(self) -> list[OpRecord]:
    self._records.sort(key=lambda r: r.t_start)
    return self._records
 ```
 매 접근 시 `t_start` 로 stable sort. 즉 같은 t_start 인 record 들은 insertion
 순서를 유지. ADR-0020 D5 의 "t_start stable ordering" 요구와 정합.
 Phase 2 DataExecutor 는 항상 `records` property 를 통해 접근하므로,
 record_end 호출이 t_start 와 다른 순서로 도착해도 (예: 짧은 op 가 긴
 op 보다 늦게 시작했으나 먼저 끝남) 재정렬되어 일관된 시퀀스를 받는다.
 ### D3. op_name 별 `params` 스키마 (`_extract_op_info` 매핑)
 #### D3.1. `op_kind="memory", op_name="dma_read"` (DmaReadCmd)
 ```python
 {"src_addr": int, "nbytes": int, "handle_id": str}
 ```
 #### D3.2. `op_kind="memory", op_name="dma_write"` (DmaWriteCmd)
 ```python
 {
    "src_space": str,   # handle.space ("tcm"|"hbm"|"sram"), default "tcm"
    "src_addr": int,    # handle.addr
    "shape": tuple, "dtype": str,
    "dst_space": "hbm", # DmaWrite 는 항상 HBM 으로
    "dst_addr": int,
    "nbytes": int,
    "handle_id": str,
    # record_end 시점에 src_space == "hbm" 이면 snapshot 추가 (D4)
    "snapshot": np.ndarray | None,
 }
 ```
 #### D3.3. `op_kind="gemm", op_name=f"gemm_{dtype_a}"` (GemmCmd)
 ```python
 {
    "src_a_addr": int, "src_b_addr": int, "dst_addr": int,
    "shape_a": tuple, "shape_b": tuple, "shape_out": tuple,
    "dtype_in": str, "dtype_out": str,
    "m": int, "k": int, "n": int,
    # ADR-0027: per-operand + output spaces 보존
    "src_a_space": str, "src_b_space": str, "dst_space": str,
 }
 ```
 #### D3.4. `op_kind="math", op_name=msg.op` (MathCmd; op = "exp", "sum", "add", "where" 등)
 ```python
 {
    "input_addrs": list[int],   # 입력 핸들들의 addr
    "input_shapes": list[tuple],
    "input_spaces": list[str],
    "input_dtypes": list[str],
    "dst_addr": int, "dst_space": str,
    "shape_out": tuple, "dtype": str,
    "axis": int | None,         # reduction 인 경우만 의미 있음
    # record_end 시점에 모든 input 의 스냅샷이 채워짐 (D4)
    "input_snapshots": list[np.ndarray | None],
 }
 ```
 #### D3.5. `op_kind="gemm" or "math", op_name=f"composite_{op}"` (CompositeCmd)
 ```python
 {
    "op": str,              # "gemm" | "math"
    "out_addr": int, "out_nbytes": int,
    # op == "gemm" 인 경우 GemmCmd 와 같은 필드 추가:
    "src_a_addr": int, "src_b_addr": int,
    "shape_a": tuple, "shape_b": tuple,
    "dtype_in": str, "dtype_out": str,
    "src_a_space": str, "src_b_space": str,
    "dst_space": "hbm", "dst_addr": int,  # = out_addr
 }
 ```
 `op == "gemm"` 이면 `op_kind = "gemm"`, 아니면 `"math"`. Phase 2 측에서
 GemmCmd 와 동일 path 로 재생되도록 alias.
 #### D3.6. `op_kind="memory", op_name="ipcq_copy"` (record_copy 전용 경로)
 ```python
 {
    "src_space": str, "src_addr": int,
    "dst_space": str, "dst_addr": int,
    "shape": tuple, "dtype": str, "nbytes": int,
    "snapshot": np.ndarray | None,   # 호출자가 전달, 없으면 record_copy 가 fresh read
 }
 ```
 `PE_DMA._handle_ipcq_inbound` (ADR-0023 D9) 가 이 record 를 발사하여 IPCQ
 slot 의 inbound copy 를 Phase 2 가 재생 가능하게 한다. 이 record 는
 `record_start` / `record_end` 를 거치지 않고 직접 `record_copy()` 로 push.
 #### D3.7. `op_kind="unknown", op_name=type(msg).__name__`
 `_extract_op_info` 가 인식 못 한 message 의 fallback. params = `{}`.
 DataExecutor 가 이 op_kind 를 만나면 skip — Phase 2 replay 에 영향 없음.
 ### D4. snapshot 캡처 시점
 `OpLogger._memory_store` 가 set 되어 있을 때 record_end 가 다음을 수행:
 - **math op**: 모든 input addr/shape/space/dtype 으로
  `self._memory_store.read(...)` 를 호출하여 `params["input_snapshots"]` 에
  ndarray copy 첨부. read 실패 시 None.
 - **dma_write op**: `src_space == "hbm"` 인 경우에만 source HBM 의
  스냅샷을 `params["snapshot"]` 에 첨부. TCM source 는 **명시적으로
  스킵** — TCM (PE scratch) 은 Phase 2 math/gemm 재생이 다시 채우므로,
  Phase-1-time snapshot 을 잡으면 이전 kernel 의 stale 데이터를 잡을 위험
  (ADR-0027 postmortem: TP gemm → all_reduce race).
 - **ipcq_copy**: `record_copy` 호출자가 `snapshot=token.data` 같이 in-flight
  스냅샷을 전달. 없으면 record_copy 가 fresh read 로 대체 시도.
 스냅샷은 `.copy()` 가 호출되어 (`ndarray.copy()` 가 fresh allocation) 이후
 storage mutation 으로부터 안전. ADR-0027 의 "cross-PE Phase 2 ordering"
 race 회피의 근간.
 `memory_store` 가 None 인 경우 (Phase 1 timing-only 모드) 스냅샷 단계는
 전부 skip. record 의 timing 정보만 보존되며 데이터 replay 는 불가능.
 ### D5. TileToken 처리 — record_start 가 stage 정보를 캡처
 ADR-0014 D6 의 self-routing tile token (pipeline 모드) 은 stage_idx 가
 record_end 시점에 이미 advance 되어 있을 수 있다 (TileToken 이 다음
 component 로 이동하면서 next stage 의 params 를 캐시). 따라서:
 `record_start` 가 다음을 `pending[id(msg)]["snap"]` 에 미리 저장:
 ```python
 snap["stage_type"] = stage.stage_type.name        # "DMA_READ", "GEMM", 등
 snap["stage_params"] = dict(stage.params)         # 시점의 params 복사본
 ```
 `record_end` 에서 이 snap 을 꺼내 params 에 merge:
 - `params["stage_type"]` 가 final params 에 추가.
 - `stage_params` 의 key 들이 (이미 있으면 보존) merge.
 - `op_name == "TileToken"` 이면 `op_name = f"TileToken/{stage_type}"` 로
  rewrite (예: `"TileToken/DMA_READ"`) — 같은 component 에서 발생한 서로
  다른 stage 의 record 를 disambiguate.
 이 메커니즘 덕분에 DMA_READ vs DMA_WRITE, FETCH vs STORE 가 같은 component
 (예: pe_dma) 에서 발생하더라도 reporting 측에서 구분 가능.
 ### D6. `MemoryStore` — (space, addr) 두 단계 dict
 ```python
 class MemoryStore:
    def __init__(self) -> None:
        self._storage: dict[str, dict[int, np.ndarray]] = {}
    def write(self, space, addr, data): self._storage[space][addr] = data
    def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ...
    def has(self, space, addr) -> bool: ...
    def snapshot(self) -> MemoryStore: ...
 ```
 #### D6.1. space namespace
 문자열 키. 표준 값:
 - `"hbm"`: HBM 데이터 (deploy_tensor + Phase 2 dma_write 결과).
 - `"tcm"`: PE-로컬 TCM (Phase 2 math/gemm 결과).
 - `"sram"`: cube-level SRAM (ADR-0023 D9.7 IPCQ slot tier).
 다른 space (예: `"reg"`) 도 자유롭게 허용 — `_storage` 가 lazy dict 라
 새 space 가 write 호출과 함께 자동 생성.
 #### D6.2. address keying
 `addr` 는 정수. **physical address (PA) 또는 virtual address (VA)** 일 수
 있다 — MemoryStore 자체는 address space 의 의미를 모르고 그저 키로 쓴다.
 Phase 1 의 `MemoryWriteMsg` 는 PA + VA 둘 다 write (`_create_tensor` 에서
 PA 로 zero-init, VA base 로도 zero-init), Phase 2 는 op_log 가 captured
 한 address 로 read/write.
 `addr` 의 의미는 호출자가 결정한다 — `MemoryStore` 는 lookup 만 제공.
 #### D6.3. read/write 의미 — reference store (no copy)
 `write(space, addr, data)`: `data` ndarray 의 reference 를 저장. **copy
 하지 않음**. 호출자가 같은 ndarray 를 이후 mutate 하면 stored value 도
 변경된다.
 `read(space, addr, shape=None, dtype=None)`: 저장된 ndarray 의 reference
 반환. `shape` 또는 `dtype` 이 제공되면:
 - `dtype != stored.dtype`: `arr.view(np_dtype)` 로 reinterpret cast (no
  copy).
 - `shape != stored.shape`: `nbytes` 가 일치하면 `arr.reshape(shape)` (view).
 - `nbytes` 불일치: `ValueError`.
 데이터를 안전하게 분리하려면 호출자가 `arr.copy()` 호출. ADR-0027 의
 race 회피가 op_log snapshot 단계에서 명시적 copy 를 강제하는 이유.
 #### D6.4. `has(space, addr) -> bool`
 해당 키의 존재 여부만 확인. 데이터 인스턴스화는 안 함.
 #### D6.5. `snapshot() -> MemoryStore`
 shallow copy. inner dict 의 새 인스턴스를 만들되 ndarray reference 는
 공유. Phase 2 초기화 시점에 Phase 1 의 store 를 fork 하여 Phase 2 의
 mutation 이 Phase 1 의 다른 사용처에 영향을 주지 않게 분리하는 데 사용.
 ### D7. op_log 가 SimPy 단일-스레드를 가정한다
 `OpLogger` 의 `_records`, `_pending` 은 lock 없이 사용. SimPy 가 single-
 threaded 라 `record_start` → `record_end` 사이에 다른 thread 가 끼어들
 수 없다는 가정.
 향후 multi-process kernbench (ADR-0047 D6) 가 도입되면 OpLogger 도 process
 별로 분리되어야 함이 명시. 단일 OpLogger 인스턴스가 multiple process 의
 record 를 받지 못한다.
 ## Alternatives Considered
 ### A1. op_log 를 SQLite / parquet 같은 외부 store 로
 기각 (현재). in-memory list 가 Phase 1 → Phase 2 의 핸드오프 latency 를
 최소화한다. 외부화는 long-running batch run 에서 의미가 있겠으나, 현재
 single-run 워크로드 에서는 overhead 만 추가.
 ### A2. snapshot 을 record_start 시점에 캡처
 기각. record_start 시점은 input 이 아직 채워지지 않은 상황 (예: math
 op 의 input 이 직전 op 의 output 일 때) 이 흔하다. record_end 가 정확한
 시점.
 ### A3. MemoryStore 를 component-별 store 로 분리
 기각. (space, addr) 키가 이미 충분히 disambiguation 을 제공하며, component
 별 분리는 cross-PE IPCQ copy (ADR-0023 D9) 가 source/destination 양쪽
 store 를 접근해야 하는 케이스를 복잡하게 만든다.
 ### A4. op_log 에 cross-op dependency edge 명시
 부분 채택. `dependency_ids` 필드가 OpRecord 에 자리 잡고 있지만 현재
 사용되지 않음 (D1). Phase 2 DataExecutor 가 t_start 정렬 + secondary sort
 (memory ops before math at same t_start) 로 ordering 을 결정하며, 명시적
 dependency graph 가 필요해지면 이 필드가 채워질 자리. 현재는 ordering rule
 이 충분하므로 미사용.
 ## Consequences
 - ADR-0020 의 op_log / MemoryStore 선언이 D1–D6 의 구체 schema 로 확장
  되어, Phase 2 DataExecutor 작성/수정 시 정확한 필드 의미를 grep 없이
  ADR 에서 확인 가능.
 - D3 의 op_name 별 params 스키마가 명시되어, 새 op (예: 새 reduction
  type) 추가 시 `_extract_op_info` 분기 어디에 끼울지 명확.
 - D4 의 snapshot 시점 차이 (math = input snapshot, dma_write = HBM-only
  snapshot) 가 ADR 에 굳어져, ADR-0027 의 cross-PE race 회피 결정이 향후
  refactor 에서 silently 깨지지 않음.
 - D6.3 의 reference-store 의미가 명시되어, 호출자가 mutation safety 책임
  을 인지. ADR-0027 의 explicit `.copy()` 패턴이 정당화됨.
 - D7 의 single-thread 가정이 명시되어, multi-process kernbench (ADR-0047
  D6 supersession 후보) 도입 시 OpLogger 분리가 필요함이 분명.
@@ -0,0 +1,307 @@
 # ADR-0053: Topology Builder + Visualizer Algorithms
 ## Status
 Accepted (2026-05-22).
 `topology/builder.py`, `topology/mesh_gen.py`, `topology/visualizer.py` 가
 함께 수행하는 토폴로지 컴파일·시각화 파이프라인의 핵심 알고리즘 선택
 (placement-driven router attachment, mesh auto-layout, source_hash 캐시,
 view projection, SVG rendering) 을 명시한다. ADR-0006 가 topology
 compilation 의 high-level intent (compiled topology, distance extraction,
 automatic diagram generation) 를 정의하나, **builder 가 실제로 어떤
 알고리즘을 사용하는지** 는 코드 grep 으로만 확인 가능했다.
 ## First action (제일 처음에 하는 일)
 `resolve_topology(path_str)` 가 호출되면 다음 4 단계가 순서대로 일어난다:
 1. **경로 검증** (`builder.py::resolve_topology`):
   `Path(path_str).expanduser().resolve()`, 존재 확인, file 여부 확인.
   실패 시 `FileNotFoundError` 또는 `ValueError`.
 2. **YAML 파싱** (`_read_spec`): `yaml.safe_load`. parse error 면 line/
   column 정보 포함한 `ValueError`. dict 가 아니면 reject.
 3. **mesh 자동 생성** (`mesh_gen.ensure_mesh_file`): topology yaml 과
   같은 디렉터리에 `cube_mesh.yaml` 을 만들거나 (캐시 invalid 시) 재사용
   (캐시 hit 시). 이 단계가 cube NoC 의 라우터 grid 와 부착 정보를 결정.
 4. **graph 컴파일** (`_compile_graph`): system → IO chiplets → cubes →
   inter-cube edges → IO↔cube edges → system↔IO edges 순으로 nodes/edges
   를 누적, 그 다음 4 개의 view projection (system, sip, cube, pe) 을
   생성하여 `TopologyGraph` 로 묶음.
 즉, **topology compile 의 첫 일은 "topology.yaml 을 dict 로 읽고, 동일
 디렉터리에 cube_mesh.yaml 을 생성/검증한 뒤, system→sip→cube→pe 순으로
 flat graph + 4-view projection 을 만드는 것"** 이다.
 ## Context
 `topology/` 패키지의 책임:
 - **builder.py** (1207 줄): topology.yaml 을 받아 `TopologyGraph` (nodes
  + edges + 4 view projections) 를 컴파일.
 - **mesh_gen.py** (305 줄): cube NoC 의 라우터 grid 와 PE/UCIe/M_CPU/SRAM
  부착 위치를 자동 결정하여 `cube_mesh.yaml` 로 캐시.
 - **visualizer.py** (887 줄): `TopologyGraph` 로부터 SVG 다이어그램 4종
  (system / sip / cube / pe) 을 생성.
 ADR-0006 가 "topology compilation 의 결과는 distance metadata 와 diagram
 generation 의 single source" 라는 high-level 결정을 정의하나, 구체 알고리즘
 (예: placement-driven nearest-router attachment, HBM 제외 zone 산출,
 source_hash 의 어떤 필드가 invalidation 을 트리거하는가) 은 ADR 에 없다.
 특히 다음 결정들이 ADR-level 에 부재:
 - 왜 mesh_gen 이 별도 파일 (`cube_mesh.yaml`) 로 캐시되는가?
 - source_hash 가 어떤 필드를 포함하며, 어떤 변경이 재생성을 강제하는가?
 - placement coordinate 가 cube 좌표가 아닌 mm 단위인 이유?
 - HBM zone 제외와 UCIe N/S/E/W 분배가 mesh 안에서 어떻게 결정되는가?
 - view projection 4 개 (system/sip/cube/pe) 의 추상화 레벨 차이?
 이 ADR 이 이 결정들을 한 곳에 정리한다.
 ## Decision
 ### D1. compile 파이프라인 — 6 단계
 `_compile_graph(spec)`:
 1. **시스템 노드 생성** (`_instantiate_system`): `fabric.switch0`, host CPU
   등 system-level 노드 추가.
 2. **per-SIP loop** (`for sip_id in range(system.sips.count)`):
   - **IO chiplets** (`_instantiate_io_chiplets`): pcie_ep / io_cpu /
     io_noc / io_ucie PHY / conn 노드 + 내부 양방향 edge 생성.
   - **cube instantiation** (`_instantiate_cube`): cube_mesh.yaml 의 router
     grid 를 토대로 cube-별 라우터, PE sub-components (pe_cpu, pe_dma,
     pe_fetch_store, pe_gemm, pe_math, pe_mmu, pe_tcm, pe_scheduler,
     pe_ipcq), m_cpu, sram, hbm_ctrl 인스턴스화 + 내부 edge 깔기.
   - **inter-cube edges** (`_add_inter_cube_edges`): UCIe N/S/E/W mesh
     edge.
   - **IO ↔ cube edges** (`_add_io_to_cube_edges`): io_noc 와 cube 의
     edge UCIe phy 사이 연결.
 3. **switch ↔ IO edges** (`_add_system_to_io_edges`): `fabric.switch0`
   와 각 SIP 의 `pcie_ep` 사이 양방향 edge (ADR-0038 D3 + ADR-0010 의
   cross-SIP IPCQ 경로).
 4. **view projections** 4 종 build:
   - `_build_system_view(spec)` — Tray 레벨, SIP 들과 system switch.
   - `_build_sip_view(spec)` — SIP 안의 cube mesh + IO chiplet.
   - `_build_cube_view(spec)` — 단일 cube 안의 router grid + PE/M_CPU/SRAM/
     HBM_CTRL 부착.
   - `_build_pe_view(spec)` — 단일 PE 안의 9 sub-components + 내부 edge.
 5. **TopologyGraph 리턴**: `TopologyGraph(spec, nodes, edges, system_view,
   sip_view, cube_view, pe_view)`.
 이 6 단계는 **순서가 의미를 가진다**: cubes 가 만들어진 후에야 inter-cube
 edges 가 valid 한 src/dst 를 갖고, IO chiplet 이 먼저 만들어져야 IO ↔ cube
 edge 가 그를 참조할 수 있다. 새 노드 종류를 끼울 때는 의존 관계를 보고
 적절한 위치에 삽입해야 한다.
 ### D2. `cube_mesh.yaml` — 별도 파일 + source_hash 캐시
 `mesh_gen.ensure_mesh_file(cube_spec, mesh_path)`:
 1. `source_hash = _compute_source_hash(cube_spec)` 산출. 입력 필드:
   - `geometry` (cube_mm.w/h 등).
   - `pe_layout` (corners, pe_per_corner).
   - `ucie.n_connections`.
   - `memory_map.hbm_mapping_mode`.
   - `placement` (m_cpu/sram pos_mm).
 2. `mesh_path` (= `topology.yaml` 와 같은 디렉터리의 `cube_mesh.yaml`) 이
   존재하고 `existing.source_hash == source_hash` 면 재사용 (캐시 hit).
 3. 아니면 `_generate_mesh(cube_spec, source_hash)` 로 새 mesh 생성 후
   yaml 로 저장.
 별도 파일로 캐시하는 이유:
 - mesh 생성은 PE/UCIe/router 부착 계산이 들어가 매번 다시 하기 무거움.
 - 같은 cube spec 으로 여러 번 실행 시 동일 mesh 가 보장되어야 함.
 - 사람이 직접 mesh 를 inspect / debug 할 수 있는 artifact 가 됨.
 `source_hash` 가 list 한 5 개 필드가 mesh 형상을 결정하는 핵심이며, 그
 외 (예: bandwidth, overhead_ns) 변경은 mesh 재생성을 트리거하지 않는다.
 ### D3. cube NoC mesh auto-layout 알고리즘
 `_generate_mesh(cube_spec)`:
 #### D3.1. 행/열 결정
 - `pe_positions = _corner_pe_positions(cube_w, cube_h)`: 4 corner (NW/NE/
  SW/SE) 마다 PE center 좌표 (mm). hardcoded `(1.5, 1.5)` / `(cube_w-1.5,
  cube_h-1.5)` 패턴 + `pe_per_corner=2` 면 각 corner 에 2 PE 위치.
 - `col_xs = _compute_col_positions(...)`: PE 들의 x 좌표 union + `max_spacing
  = 3.0 mm` 보다 큰 gap 에 relay 컬럼 삽입.
 - `row_ys, rows_per_half = _compute_row_positions(cube_h, n_connections,
  pe_positions)`:
  - `n_conn = max(n_connections, 2)` (hot path minimum).
  - `rows_per_half = ceil(n_conn / 2)`.
  - top 절반 + HBM 두 row + bottom 절반. HBM 은 `(cube_h/2 - 1.5, cube_h/2
    + 1.5)` 에 위치. PE rows 와 HBM rows 사이 `hbm_gap = 1.5 mm`.
 #### D3.2. HBM 제외 zone
 `hbm_row_start = rows_per_half`, `hbm_row_end = rows_per_half + 1`.
 `hbm_col_start = n_cols // 2 - 1`, `hbm_col_end = n_cols // 2`.
 이 (row, col) 사각형 안의 router 슬롯은 `None` 으로 마킹 (라우터 없음).
 실제 HBM 컨트롤러는 별도 `hbm_ctrl.pe{X}` 노드로 ADR-0017 D9 의 per-PE
 파티션 패턴을 따라 부착.
 #### D3.3. PE 부착
 각 corner 의 PE 들은 다음 row 에 매핑:
 - Top half: NW → row 0, NE → row 1 (top_corners 안의 index).
 - Bottom half: SW → row `hbm_row_end + 1`, SE → row `hbm_row_end + 2`.
 각 PE 의 x 좌표가 가장 가까운 col 의 router 에 부착 (`min(range(n_cols),
 key=lambda c: abs(col_xs[c] - pe_x))`). 부착 항목은 `pe{pe_idx}.dma`,
 `pe{pe_idx}.cpu`, `pe{pe_idx}.hbm` 세 가지 (router 별 attach list 에 push).
 #### D3.4. M_CPU / SRAM 부착 — nearest router by Euclidean distance
 `placement.m_cpu.pos_mm` (default `[1.5, 5.5]`) 와 `placement.sram.pos_mm`
 (default `[1.5, 8.5]`) 의 좌표에서 가장 가까운 router 를 Euclidean
 distance 로 찾아 attach list 에 `"m_cpu"` / `"sram"` 추가.
 #### D3.5. UCIe N/S/E/W 분배
 `ucie_pe_rows = top_pe_rows + bot_pe_rows` (총 `2 * rows_per_half` 개).
 - UCIe-E: 매 PE row 마다 rightmost col 의 router 에 `ucie_e.c{i}`.
 - UCIe-W: leftmost col 의 router 에 `ucie_w.c{i}` (E 의 mirror).
 - UCIe-N/S: PE column 들 중 절반을 좌측, 절반을 우측으로 나눠 top row /
  bottom row 의 해당 col 에 부착.
 각 UCIe connection 은 `c{i}` index 가 붙어 ucie_n_connections 만큼의 PHY
 가 분산된다 (ADR-0017 D5+).
 ### D4. node 명명 규칙 — 단일 소유자
 builder.py 는 다음 명명 규칙으로 노드를 만든다 (ADR-0051 D5 의 단일
 소유자 원칙):
 - `fabric.switch0` — system-level switch.
 - `sip{S}.{io_id}.{pcie_ep|io_cpu|io_noc|io_ucie.{dir}|conn.{id}}` — IO
  chiplet.
 - `sip{S}.cube{C}.{m_cpu|sram|hbm_ctrl.pe{X}|noc.r{R}c{C}|...}` — cube 내부.
 - `sip{S}.cube{C}.pe{P}.{pe_cpu|pe_dma|pe_fetch_store|pe_gemm|pe_math|pe_mmu|pe_tcm|pe_scheduler|pe_ipcq}` — PE sub-components.
 이 명명 규칙을 변경하려면 builder.py 와 router.py (ADR-0051) 의 helper
 양쪽이 함께 갱신되어야 한다. 컴포넌트는 명명 규칙을 직접 알지 못하고
 helper 만 호출한다.
 ### D5. edge `kind` 분류
 각 edge 가 부여받는 `kind` 가 라우팅 정책 (ADR-0051 D2) 의 입력. 주요
 kind 값:
 - `"pe_internal"` — PE 내부 sub-component 간.
 - `"pe_to_router"` — PE_DMA ↔ cube NoC router.
 - `"router_mesh"` — cube NoC router 간.
 - `"router_to_hbm"`, `"router_to_mcpu"`, `"router_to_sram"`,
  `"sram_to_router"` 등 — cube-attached component 간.
 - `"ucie_internal"`, `"ucie_conn_to_router"`, `"router_to_ucie_conn"`,
  `"ucie_conn_to_noc"`, `"noc_to_ucie_conn"`, `"ucie_mesh"` — UCIe 관련.
 - `"io_internal"` — IO chiplet 내부.
 - `"io_to_cube"`, `"cube_to_io"` — IO ↔ cube 경계.
 - `"pcie"` — switch ↔ pcie_ep.
 - `"command"` — control-plane only edges (M_CPU ↔ NOC 등; PE DMA path 에서
  제외).
 새 edge kind 를 추가하면 router.py 의 4 adjacency graph (ADR-0051 D2) 의
 어느 카테고리에 속할지 결정해야 한다 — 그렇지 않으면 default 로 `_adj_all`
 에만 포함되어 의도와 다른 routing 발생 가능.
 ### D6. view projection — 4 추상화 레벨
 `TopologyGraph` 는 flat (nodes + edges) 외에 4 개의 view projection 을
 보유:
 - **system_view** (`_build_system_view`): Tray 레벨. SIP 박스들 + `fabric.
  switch0`. PCIE 링크 표시. 외부 발표용 high-level overview.
 - **sip_view** (`_build_sip_view`): 한 SIP 안. cube mesh + IO chiplet
  (pcie_ep + io_cpu + io_noc). UCIe N/S/E/W 가 cube 간 연결로 보임.
 - **cube_view** (`_build_cube_view`): 한 cube 안. router grid + PE/M_CPU/
  SRAM/HBM_CTRL 부착 + UCIe PHY edge 부분. cube 내부 라우팅 / placement
  진단용.
 - **pe_view** (`_build_pe_view`): 한 PE 안. 9 sub-components + 내부 edge
  (pe_internal kind). 자세한 PE 내부 dataflow 검토용.
 view 는 spec 에서 `visualization.emit_views: [system, sip, cube]` 같이
 선택적으로 출력 (ADR-0006). pe view 는 기본 출력에서 빠져 있으나 코드는
 유지 (자세한 디버그용).
 ### D7. visualizer.py — SVG 다이어그램 출력
 `emit_diagrams(graph, out_dir)` 가 모든 view 를 SVG 로 렌더. 핵심 함수:
 - `_render_view_svg(view)` — 일반적인 view 렌더 (router grid 가 없는
  경우).
 - `_render_cube_view_svg(view, spec)` — cube view 전용 (HBM block 그리기,
  router grid layout, PE/M_CPU/SRAM/HBM positioning).
 - `_draw_node`, `_draw_edge` — 노드 / edge 의 시각적 표현.
 - `_pick_scale`, `_compute_node_sizes` — 자동 스케일링.
 visualizer 는 **derived artifact** (ADR-0006) 로 분류되며, 코드 변경 시
 production check 대상이 아니다. CLAUDE.md 의 "Derived Artifacts" 항목과
 정합.
 ### D8. spec 변경의 영향 범위
 | spec 필드                             | 영향              | mesh 재생성 |
 |---------------------------------------|-------------------|-------------|
 | `system.sips.count`                   | SIP 갯수, node 수 | No          |
 | `sip.cube_mesh.w/h`                   | cube mesh 형상    | No          |
 | `cube.geometry.cube_mm.w/h`           | cube 크기 (mm)    | **Yes**     |
 | `cube.pe_layout.corners/pe_per_corner`| PE 부착 위치       | **Yes**     |
 | `cube.ucie.n_connections`             | UCIe PHY 분배     | **Yes**     |
 | `cube.memory_map.hbm_mapping_mode`    | HBM 분배 모드     | **Yes**     |
 | `cube.placement`                      | M_CPU/SRAM 위치   | **Yes**     |
 | `cube.memory_map.*` (위 제외)         | HBM 용량 / BW     | No          |
 | `*.links.*.bw_gbs`                    | edge bandwidth    | No          |
 | `*.attrs.overhead_ns`                 | 컴포넌트 latency  | No          |
 위 표가 D2 의 `_compute_source_hash` 입력과 일치. mesh 재생성이 필요한
 변경은 `cube_mesh.yaml` 의 source_hash 가 자동 invalidate.
 ## Alternatives Considered
 ### A1. mesh 를 별도 캐시 파일 없이 매 compile 시 재생성
 기각. 같은 spec 으로 여러 번 호출되는 케이스 (CLI run, probe, test) 마다
 mesh 생성 비용을 다시 지불. 또한 사람이 mesh 를 inspect 할 수 있는 artifact
 가 사라짐.
 ### A2. mesh 생성을 builder.py 에 합치기
 기각 (현재). 305 줄 짜리 자체 알고리즘이며, mesh layout 의 결정 (placement-
 driven router attachment, HBM exclusion zone) 이 builder 의 일반적인
 node/edge 생성 책임과 다르다. 분리 유지가 단일 책임 원칙에 더 부합.
 ### A3. placement coordinate 를 cube 좌표 (col/row) 로 표현
 기각. mm 단위 좌표가 시각화 측 (visualizer) 과 mesh layout 측 (nearest-
 router 산출) 양쪽에서 일관되게 쓰인다. cube 좌표는 router grid 가 결정
 되기 전까지는 정의되지 않으므로 placement 입력에 부적절.
 ### A4. view projection 을 lazy 하게 생성
 기각 (현재). 4 개 view 의 생성 비용이 작고 (보통 < 100 ms), eager 생성이
 `TopologyGraph` 를 통한 single source of truth 를 보장.
 ### A5. visualizer 출력 형식을 SVG 외 (PNG/PDF) 도
 기각. SVG 가 vector + 텍스트 검색 가능 + 브라우저 직접 렌더가 가능한 가장
 유연한 형식. PNG 변환이 필요하면 별도 도구 (rsvg-convert 등) 로 후처리.
 ## Consequences
 - ADR-0006 의 high-level intent 가 D1–D7 로 구체화되어, topology 변경
  영향을 D8 표로 빠르게 가늠 가능.
 - D3 의 mesh auto-layout 알고리즘이 ADR-level 에서 굳어져, 추후 새 PE
  부착 패턴 (예: HBM 의 6-zone 분할) 도입 시 어느 단계가 영향받는지 명확.
 - D5 의 edge kind 목록과 D7 의 view 구조가 명시되어, 새 component 종류
  추가 시 (builder + router + visualizer) 어디까지 손대야 하는지 PR
  reviewer 가 한눈에 파악 가능.
 - D2 의 source_hash invalidation 규칙이 명시되어, cube_mesh.yaml 이 stale
  하게 남는 경우 (예: bw 값만 바꿨을 때) 가 정상 동작임이 분명.
@@ -0,0 +1,138 @@
 # ADR-0054: 마일스톤 평가 bench — 자기완결적 sweep + figure bench
 ## Status
 Accepted (2026-05-22).
 ADR-0044(D1/D2)와 ADR-0045(D5)를 개정하고, ADR-0043/0044의 "로직이
 `scripts/` + `tests/`에 산다" 배치를 대체한다: GEMM/allreduce 평가
 하니스가 이제 사용자가 실행하여 모든 결과 + figure를 재생성하는
 자기완결적 **bench**가 된다.
 ## Context
 ADR-0043(allreduce 평가)과 ADR-0044(GEMM 평가)는 각 하니스를 **sweep**
 (수동 `scripts/` 드라이버, 또는 allreduce의 경우 parametrized 테스트
 자체) + committed 데이터를 렌더링하는 **figure 테스트**로 분리했다.
 따라서 sweep/render 로직은 `scripts/gemm_sweep.py`,
 `tests/gemm/_gemm_plot_helpers.py`, `tests/sccl/_allreduce_helpers.py`에
 존재했다.
 마일스톤 요구사항("사용자가 *하나의 bench*를 실행해 모든 결과와 플롯을
 생성하도록 allreduce + GEMM 평가를 리팩터")은 그 배치로는 충족 불가다:
 bench는 production 코드이며 **`tests/`를 import할 수 없다**(ADR-0007 레이어
 방향). 평가 로직은 bench에서 닿을 수 있도록 production으로 이동해야 했다.
 선택한 home은 별도 `kernbench.eval` 패키지가 아니라 bench 모듈 자체다.
 bench 파일은 임의의 모듈 레벨 코드를 가질 수 있으며, 하니스를 bench로
 합치면 도메인당 파일 하나가 유지되고 패키지 레이어가 하나 줄어든다.
 ## Decision
 ### D1. 두 마일스톤 bench가 평가 로직을 보유
 - `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep
  + 세 figure renderer(`scripts/gemm_sweep.py` +
  `tests/gemm/_gemm_plot_helpers.py`에서 이동).
 - `src/kernbench/benches/milestone_1h_ccl.py` — distributed allreduce
  드라이버, latency + buffer-kind sweep, topology diagram, FSIM 비교, 그리고
  direct-launch 패리티 레퍼런스(`tests/sccl/_allreduce_helpers.py`에서 이동).
 각 파일은 해당 도메인 평가 로직의 **단일 home**이다.
 ### D2. "평가 bench" 패턴 (ADR-0045 D5 확장)
 ADR-0045 D5는 bench를 단일 구성(single-SIP, 또는 ADR-0024 multi-SIP CCL
 예외)으로 고정했다. 본 ADR은 세 번째 패턴을 추가한다:
 - **평가 bench**는 *여러* 구성을 구동하고 figure를 렌더링할 수 있다. 외부
  `run_bench` 엔진 대신 sweep 지점마다 자체 `GraphEngine` /
  `RuntimeContext`를 빌드한다.
 - 그러면 외부 ctx에 제출된 handle이 없으므로, bench는 마지막에
  **sentinel 텐서**(`torch.zeros((1, 1), …)`)를 제출하여 `run_bench`의
  "최소 한 번 제출" 계약(ADR-0045 D4)을 만족시키고 CLI가 0으로 종료되게
  한다.
 ### D3. 출력 위치
 두 bench 모두 `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`에
 쓴다(사용자 요청 — bench 옆 아티팩트). 디렉터리는 생성된 PNG/CSV/JSON만
 보유하며(`.py`/`__init__.py` 없음), 따라서 eager-import audit(ADR-0045
 첫 동작)이 무시한다 — `pkgutil.iter_modules`는 비-패키지 하위 디렉터리를
 yield하지 않는다. `docs/diagrams/` 아티팩트처럼 **커밋된다**(원격에서
 figure를 볼 수 있도록); bench 재실행 시 제자리에서 재생성된다.
 ### D4. GEMM 무거운 sweep — 기본은 fresh, `MILESTONE_FAST`로 재사용
 `milestone-1h-gemm`은 기본적으로 전체 24-sim sweep을 실행한다(분 단위;
 한 shape는 2048 tile). `MILESTONE_FAST=1`은 committed
 `docs/diagrams/gemm_sweep.json`을 재사용하고 렌더링만 한다(초 단위). 이는
 ADR-0044 D1/D2의 "무거운 sweep은 수동/`slow` 단계로 유지"를 뒤집는다:
 bench 실행이 곧 재생성이다. slow 경로는 `@pytest.mark.slow` bench
 테스트로 행사되고, fast 경로는 기본 실행된다.
 ### D5. 테스트 + 스크립트는 thin re-export shim으로 재사용 (단일 home 유지)
 기존 figure 테스트와 `scripts/gemm_sweep.py` 진입점은 유지되며 이제 bench
 모듈을 재사용한다:
 - `tests/gemm/_gemm_plot_helpers.py` → renderer +
  `GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT`를
  `kernbench.benches.milestone_1h_gemm`에서 re-export.
 - `tests/sccl/_allreduce_helpers.py` → 드라이버 코어, config writer, sweep
  상수, renderer, disk aggregator를 `kernbench.benches.milestone_1h_ccl`에서
  re-export하고, **pytest 전용** 조각은 로컬 유지: `pytest.param` 행렬
  (`CONFIGS` / `_sweep_params` / `_bk_params`)과 fixture 결합
  `_run_distributed`(`monkeypatch.chdir` + `_drive_distributed`) wrapper.
 - `scripts/gemm_sweep.py` → bench의 `run_sweep` 위 thin wrapper.
 테스트가 bench 모듈을 import하는 것은 허용된다(테스트는 production 위에
 위치, ADR-0007); 이는 전체 패키지 eager audit을 유발하며, 그것은 이미 매
 `kernbench` 실행 시 동작한다. matplotlib는 renderer 내부에서 lazy import로
 유지되어 audit의 startup 비용은 불변이다.
 ### D6. 평면 모듈 네이밍 (`benches/` 하위 폴더 없음)
 `1H_milestone…`로 명명된 `benches/` 하위 패키지는 불가능하다 — Python
 패키지 이름은 숫자로 시작할 수 없다. 따라서 bench는 평면 모듈
 `milestone_1h_gemm.py` / `milestone_1h_ccl.py`이며 bench 이름은
 `milestone-1h-gemm` / `milestone-1h-ccl`(kebab-case, ADR-0045 D1에 따라
 글자로 시작)이다.
 ## Consequences
 ### Positive
 - `kernbench run --bench milestone-1h-gemm`(또는 `…-ccl`)이 도메인의 모든
  결과 + figure를 한 명령으로 재생성한다 — 마일스톤 요구사항.
 - 평가 로직의 단일 소스(bench), shim을 통해 테스트와 스크립트가 재사용;
  중복 없음.
 - figure 테스트와 `scripts/gemm_sweep.py`는 변경 없이 계속 동작.
 ### Negative / limitations
 - 두 bench 파일이 크다(CCL 쪽은 distributed 드라이버, sweep, matplotlib
  드로잉을 섞는다). 대부분 평가 하니스인 "bench"는 이례적이며, 본 ADR이
  이를 정당화한다.
 - 생성 아티팩트가 명시적 요청에 의해 source tree(`src/kernbench/benches/`)
  안에 살며 커밋된다(원격에서 figure를 볼 수 있도록); bench 재실행 시
  재생성된다.
 - `milestone-1h-ccl`(및 기본 `milestone-1h-gemm`)은 분 단위 소요 —
  on-demand 마일스톤 아티팩트에는 수용 가능, 일상 실행에는 아님.
 ## Dependencies
 - **ADR-0007**: 레이어 방향(테스트는 production을 import할 수 있으나 bench는
  테스트를 import할 수 없는 이유).
 - **ADR-0043 / ADR-0044**: 본 ADR이 bench로 이전하는 allreduce / GEMM 평가
  하니스.
 - **ADR-0045**: bench 모듈 계약; 여기 D2가 그 D5(single-device 규칙)를
  평가-bench 패턴으로 확장하고, sentinel을 위해 D4(NO_REQUESTS)에 의존.
 - **ADR-0024**: allreduce sweep이 구동하는 rank = SIP launcher.
 ## Open questions
 - GEMM theoretical 모델 상수(ADR-0044 D5)를 복사 대신 ADR-0033/0014에서
  소싱해야 하는가? 본 ADR로는 불변.
 - `build_overview_slides.py`가 GEMM 막대를 네이티브로 그리는 대신 마일스톤
  출력 PNG를 소비해야 하는가? 여전히 open(ADR-0044 D6 / Negative).
@@ -0,0 +1,175 @@
 # ADR Index
 Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **47**.
 Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
 ## Design Principles
 - [ADR-0013](./ADR-0013-ver-verification-strategy.md) — 검증 전략 및 Phase 1 테스트 계획
 - [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — 레이턴시 모델: 가정 및 알려진 단순화
 ## High-level Architecture
 - [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — 타겟 시스템 계층 및 모델링 범위  _(System hierarchy (Tray / SIP / CUBE / PE))_
 - [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — 런타임 API 및 시뮬레이션 엔진 경계  _(Runtime API ↔ sim_engine boundaries)_
 - [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NoC와 메모리 데이터 경로  _(IOChiplet NOC and memory data path)_
 - [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — 큐브 NoC와 HBM 연결성  _(Cube NOC and HBM connectivity)_
 ## Detailed Architecture
 One subsection per component file under `src/kernbench/components/builtin/`.
 ### forwarding
 - [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding 컴포넌트 (forwarding_v1)
 ### hbm_ctrl
 - [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM 컨트롤러 내부 설계
 ### io_cpu
 - [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU 컴포넌트 모델
 ### m_cpu
 - [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU 및 M_CPU.DMA 컴포넌트 모델
 ### pcie_ep
 - [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
 ### pe_cpu
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
 ### pe_dma
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
 - [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
 ### pe_fetch_store
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
 ### pe_gemm
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
 ### pe_ipcq
 - [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
 ### pe_math
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
 ### pe_mmu
 - [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — 컴포넌트 + 유틸리티 이중 역할
 ### pe_scheduler
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE 파이프라인 실행 모델
 ### pe_tcm
 - [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — 듀얼 채널 BW 직렬화
 ### sram
 - [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
 ### tiling
 - [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math 파이프라인 plan 빌더
 ## Implementation Decisions
 ### Address Scheme
 - [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51비트 물리 주소 레이아웃 및 디코딩 계약
 - [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — 메모리 주소 지정 — PA / VA / LA 주소 모델
 ### Routing & Helper API
 - [ADR-0002](./ADR-0002-lat-routing-distance.md) — 라우팅 거리, 순서 및 우회 규칙
 - [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
 ### Memory Semantics & Local-HBM Bandwidth
 - [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — 메모리 시맨틱 및 로컬 HBM 대역폭 보장
 ### Topology Compilation, Diagrams & Builder Algorithms
 - [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — 다이어그램 뷰 및 거리 기반 레이아웃 규칙
 - [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — 토폴로지 컴파일, 거리 추출, 그리고 자동 다이어그램 생성
 - [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
 ### Tensor Deployment and Allocation
 - [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — 텐서 배포 및 할당 (호스트 할당기, PA 우선)
 ### Kernel Execution and Host-Device Messaging
 - [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — 커널 실행 메시징 및 완료 시맨틱
 - [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU 메시지 스키마 (PA-우선, PE-태깅)
 ### CLI Surface and Semantics
 - [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — 명령줄 인터페이스 및 실행 시맨틱
 ### Component Port/Wire Fabric Model
 - [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — 컴포넌트 포트/와이어 모델과 패브릭 라우팅
 ### Two-Pass Data Execution
 - [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass 데이터 실행 모델 (타이밍 / 데이터 분리)
 ### 2D Grid Program Identity
 - [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D 그리드 program_id 시맨틱
 ### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
 - [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
 - [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — sip/num_sips 필드 제거
 - [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
 - [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
 - [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
 ### IPCQ Direction Addressing
 - [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
 ### Intercube All-Reduce
 - [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — 큐브 간 All-Reduce — pe0 큐브-메시 리듀스 + 다중-SIP 교환
 ### Evaluation Harnesses
 - [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce 평가 하니스 — `tests/sccl/`
 - [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM 평가 하니스 — `scripts/gemm_sweep.py` + `tests/gemm/`
 - [ADR-0054](./ADR-0054-eval-milestone-benches.md) — 마일스톤 평가 bench — 자기완결적 sweep + figure bench
 ### Bench Module Contract
 - [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
 ### Kernel-side tl.* API (TLContext)
 - [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
 ### Memory Allocator Algorithms
 - [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
 ### Probe Subcommand
 - [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
 ### Sim-engine Op Log and Memory Store Schemas
 - [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
@@ -7,6 +7,11 @@ Accepted
 Documents the `tests/sccl/` evaluation harness; verified against the
 implementation (constants, file set, and sweep dimensions cross-checked).
 **Amended by ADR-0054**: the driver core, sweeps, and renderers moved into
 the `milestone-1h-ccl` bench (single home); `tests/sccl/_allreduce_helpers.py`
 now re-exports from it (keeping the pytest-only param builders +
 `_run_distributed` wrapper local). The figure tests are unchanged.
 ## Context
 ADR-0032 defines the intercube all-reduce *algorithm*; ADR-0023/0024/0027
@@ -9,6 +9,12 @@ implementation (constants, tile sizes, figure set, and the script↔test
 split cross-checked). The D5/D6 caveats are recorded limitations, not
 inaccuracies.
 **Amended by ADR-0054**: the sweep + renderers moved into the
 `milestone-1h-gemm` bench (single home); `scripts/gemm_sweep.py` and
 `tests/gemm/` now re-export from it. D1/D2's "data generation stays a manual
 script / heavy work is opt-in" is superseded by the eval-bench pattern (one
 bench regenerates everything; `MILESTONE_FAST=1` reuses the committed JSON).
 ## Context
 ADR-0014 (PE pipeline) and ADR-0042 (tile-plan generators) define the GEMM
@@ -0,0 +1,295 @@
 # ADR-0045: Bench Module Contract — registration, dispatch, and authoring
 ## Status
 Accepted (2026-05-21).
 Unifies the `src/kernbench/benches/` registration mechanism (@bench), the
 CLI dispatch path (`kernbench run/list`), and the contract a new bench
 module must follow. ADR-0010 (CLI surface) specifies the `kernbench
 list/run` interface, but **how benches are registered and what signature
 they must follow** had no ADR-level coverage.
 **Extended by ADR-0054**: D5's single-config rule gains a third pattern —
 the *eval bench* (e.g. `milestone-1h-*`) drives many configs, builds its
 own per-config engines, and submits a sentinel tensor to satisfy D4.
 ## First action
 When `kernbench.benches` is imported, `__init__.py` immediately calls
 `_eager_import_and_audit(__path__, __name__)`. Its first action is to
 enumerate every sibling module in the package directory via
 `pkgutil.iter_modules(__path__)` and **eagerly import** each one via
 `importlib.import_module(...)` — except modules matching either:
 - name `registry` (the infrastructure module itself), or
 - name starting with `_` (helper modules).
 At import time, each `@bench(name=..., description=...)` decorator inside
 the imported module runs, appending `(name, description, fn)` to
 `_PENDING` and adding `fn.__module__` to `_REGISTERED_MODULES`.
 Once imports finish, `_audit_modules(imported, _REGISTERED_MODULES)`
 runs; if any imported module did not invoke `@bench` at least once, it
 raises `RuntimeError("Bench module(s) missing @bench decorator: ...")`
 immediately. At this point indices are still unassigned — the first call
 to `list_all()` / `resolve(...)` triggers `_finalize()`, which sorts
 `_PENDING` alphabetically by name and assigns 1-based indices.
 In short, **the bench infrastructure's first act is "eagerly import
 every non-helper module in the package and audit that each one
 registered at least one bench"**.
 ## Context
 `src/kernbench/benches/` currently holds 8 bench modules (`ccl_allreduce`,
 `gemm_single_pe`, `gpt3_qkv`, `ipcq_allreduce`, `matmul_composite`,
 `qkv_gemm`, `qkv_gemm_multi_pe`, `va_offset_verify`). Every bench follows
 the same unified flow:
 ```
 kernbench run --topology <T> --bench <N>
   ↓
 cli/main.py::cmd_run
   ↓  resolve_topology(T)  + resolve(N)  + resolve_device(device_arg)
   ↓
 runtime_api/bench_runner.py::run_bench(topology, bench_fn, device, engine_factory)
   ↓  engine_factory(topology, device) → GraphEngine
   ↓  RuntimeContext(engine, target_device, correlation_id, spec)
   ↓
 bench_fn(ctx)        ← invokes the bench's run(torch)
   ↓  ctx.empty/zeros/from_numpy/launch/distributed.* etc. submit work
   ↓
 ctx.wait_all()       ← drains any outstanding handles
   ↓
 BenchResult(completion, correlation_id, trace, traces, engine)
 ```
 ADR-0010 covers only the CLI surface (`run/list/probe/web`); ADR-0007
 covers only the runtime API ↔ sim_engine boundary. The question "what
 shape must a new bench file take?" had to be answered by grepping the
 codebase. As a result:
 - The @bench decorator contract (kebab-case name, non-empty description)
  lived only in the source.
 - The bench function signature (`def run(torch)`) was a de-facto
  convention enforced by the CLI dispatcher calling `spec.run`.
 - New bench authors learned the "helpers must use `_` prefix" rule only
  after seeing the audit's RuntimeError.
 - The single-device convention (CLAUDE.md Part 2 CLI Semantics) and its
  interaction with multi-SIP CCL benches was ambiguous for bench
  authors.
 This ADR consolidates all of it in one place.
 ## Decision
 ### D1. @bench decorator contract
 ```python
 from kernbench.benches.registry import bench
@bench(name="my-bench", description="Short, complete-sentence description.")
 def run(torch):
    ...
 ```
 - `name`: kebab-case string matching `^[a-z][a-z0-9]*(-[a-z0-9]+)*$`.
  Lowercase letters, digits, and dashes only; underscores forbidden;
  must start with a letter.
 - `description`: non-empty string (stripped length > 0). Displayed
  verbatim by `kernbench list`.
 - The decorator **returns the function unchanged** — direct invocation
  is fine. Its only side effect is appending to `_PENDING`.
 Violations of the first two rules raise `ValueError` at decoration time.
 Duplicate names are caught at `_finalize()` with
 `RuntimeError("duplicate bench name: ...")`.
 ### D2. Module-file convention
 Every `src/kernbench/benches/<slug>.py` must be one of:
 - **A bench module**: at top-level import, `@bench(...)` runs at least
  once to register at least one bench.
 - **A helper module**: the filename starts with `_` (e.g.,
  `_shared_helpers.py`). `iter_modules` skips it.
 The audit (`_audit_modules`) rejects any non-helper that fails to call
 `@bench`. Intended consequence: dropping a new file into `benches/`
 automatically registers its benches, and helper modules are clearly
 flagged by their filename prefix alone.
 ### D3. The bench function signature is `def run(torch)`
 The decorator does not enforce a function name, but **CLI dispatch calls
 `spec_entry.run`** (the decorated callable). The convention is therefore:
 - Function name: `run`. Other names work, but always use `run` for
  readability and grep-ability.
 - Argument: a single positional `torch`. In practice this is a
  `RuntimeContext` instance exposing PyTorch-style namespaces
  (zeros/empty/launch/distributed/...) — see ADR-0024 D3.
 - Return value: any (`Any`). `run_bench` ignores it and tracks
  completion via `ctx.handles()` / `engine.get_completion()`.
 The `torch` name imitates a PyTorch-compatible idiom; the actual PyTorch
 module is not passed in (aligned with ADR-0024's "rank = SIP" launcher
 convention).
 ### D4. A bench must submit at least once
 If `ctx.handles()` is empty after the bench returns, `run_bench` reports
 `BenchResult.completion = ok=False, error_code="NO_REQUESTS"`. So a
 meaningful bench must invoke at least one of:
 - Tensor-creation APIs: `torch.zeros(...)`, `torch.empty(...)` — these
  internally submit `MmuMapMsg` and (for zeros) `MemoryWriteMsg`.
 - Kernel-launch API: `torch.launch(name, fn, *args)` — submits per-SIP
  `KernelLaunchMsg`.
 - (Exception) Empty placeholder benches: e.g.,
  `ipcq_allreduce.py`'s `print(...)`-only stub will receive a
  NO_REQUESTS result. CI is expected to recognize and handle placeholder
  benches specially.
 ### D5. Single-device convention + multi-SIP exception (ADR-0024/0027)
 CLAUDE.md Part 2 CLI Semantics' **"benchmarks MUST remain
 single-device"** rule is interpreted as follows:
 - **Standard bench (single-SIP use)**: define tensor placement with
  `dp = DPPolicy(...)` and launch with `torch.launch(...)`. The SIP
  index is chosen by `--device` (CLI's responsibility).
 - **CCL bench (multi-SIP use)**: as an exception, use
  `torch.distributed.init_process_group(backend="ahbm")` plus
  `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` for the
  rank = SIP pattern (ADR-0024 D3). `--device` is ignored (or treated
  as `all`); each spawned worker calls `torch.ahbm.set_device(rank)` to
  bind to its SIP.
 Multi-device patterns outside these two (e.g., one bench function
 launching across multiple SIPs in the same process) are forbidden by
 this ADR. Even with `--device all`, the bench runs once; multi-SIP use
 inside that single run must follow D5's second pattern.
 ### D6. Name/index resolution (`resolve`)
 `resolve(identifier: str)` returns a BenchSpec via:
 1. If `identifier.isdigit()`: convert to int and find the spec where
   `index ==` that value. If none, `ValueError("No bench with index
   ...")`.
 2. If `identifier in _REGISTRY`: direct lookup.
 3. Otherwise: `ValueError("Unknown bench ...")`.
 Empty or whitespace-only identifiers raise `ValueError("bench
 identifier must be a non-empty string.")`.
 The CLI passes `--bench` directly to `resolve`, so users can use either
 `kernbench run --bench gemm-single-pe` or `kernbench run --bench 2`.
 ### D7. Indices are not a stable API
 `_finalize()` sorts `_PENDING` alphabetically by name and assigns
 1-based indices. Adding a new bench can shift existing benches'
 indices. Therefore:
 - Human-interactive use: indices are fine.
 - Scripts / CI automation: always use the name.
 This caveat is documented in `registry.py`'s module docstring.
 ### D8. Surface RuntimeContext exposes to benches
 A bench's `torch` parameter may legitimately use:
 - **Tensor creation**: `torch.empty(shape, dtype=..., dp=DPPolicy(...),
  name=...)`, `torch.zeros(...)`, `torch.from_numpy(arr)`. All submit
  host-side metadata plus device deployment (`MmuMapMsg` +
  `MemoryWriteMsg`).
 - **Kernel launch**: `torch.launch(kernel_name, kernel_fn, *args)` —
  converts `(Tensor, int, float)` positional args to `TensorArg` /
  `ScalarArg`, submits per-SIP `KernelLaunchMsg`, and drains.
 - **Synchronization**: `torch.wait(handle)`, `torch.wait_all()`
  (`run_bench` calls the latter automatically).
 - **Distributed**: `torch.distributed.init_process_group(backend="ahbm")`,
  `torch.distributed.get_world_size()`,
  `torch.distributed.all_reduce(t, op=...)` (ADR-0024/0027).
 - **Multi-process (rank = SIP)**:
  `torch.multiprocessing.spawn(_worker, ..., nprocs=ws)` (ADR-0024 D3 /
  ADR-0027).
 - **Device binding**: `torch.ahbm.set_device(rank)` or
  `torch.accelerator.set_device_index(rank)` (both point to the same
  namespace).
 - **IPCQ install**: `torch.install_ipcq(algorithm=..., ccl_yaml=...)`
  (ADR-0023 D10).
 - **Spec lookup**: `torch.spec` — the dict produced by the topology
  builder (system / cube_mesh / HBM parameters etc.). Use it so the
  bench does not hardcode topology.yaml values.
 Benches must not access RuntimeContext private members (`_handles`,
 `_traces`, `_allocators`, etc.) directly. This aligns with ADR-0007's
 layer-boundary spirit: bench → runtime API → sim_engine flows in one
 direction.
 ### D9. Environment-variable parameterization is allowed
 Benches may parameterize themselves via `os.environ.get(...)`, as
 `matmul_composite.py` does for `MATMUL_M`, `MATMUL_K`, `MATMUL_N`,
 `MATMUL_DTYPE`, `MATMUL_VARIANT`. Rationale:
 - The bench function signature is fixed by D3 to `def run(torch)`, so
  positional/keyword arguments cannot carry parameters.
 - The env-var pattern is a natural hook for operational sweeps (e.g.,
  `MATMUL_VARIANT`).
 - External drivers such as `scripts/gemm_sweep.py` (ADR-0044) consume
  this hook (it sets `MATMUL_M/K/N/VARIANT` at
  `scripts/gemm_sweep.py:115-118`).
 When environment variables alter bench behavior, the module docstring
 must list every variable used (`matmul_composite.py` is the canonical
 example).
 ## Alternatives Considered
 ### A1. An explicit manifest file (YAML) listing benches
 Rejected. The `@bench` + audit pattern guarantees "drop in file → auto-
 register", concentrating cognitive cost in one place (the file itself).
 A separate manifest is prone to drift, and helper separation is already
 clear via the `_` prefix.
 ### A2. Allowing the bench's entry-point name in the decorator
 (`@bench(name=..., entry="run_xxx")`)
 Rejected. Breaks the simplicity of dispatch (`spec.run` is a single
 callable). The `run` convention is sufficient; variants can register
 multiple `@bench`-decorated functions in the same module.
 ### A3. A separate `@multi_device_bench` decorator for CCL
 Rejected. The two patterns named in D5 (single + ADR-0024 multi-SIP)
 cover all 8 current benches. A separate decorator would force dispatch
 to branch and add complexity; the multi-SIP intent is already obvious
 from the bench's `init_process_group(...)` call.
 ### A4. Make indices a stable API (registration order or explicit
 `index=` argument)
 Rejected. D7's trade-off favors user-friendliness — alphabetically
 sorted 1-based indices read naturally in the `list` output. Scripts can
 use names.
 ## Consequences
 - "How to add a bench" is consolidated in one ADR — new authors only
  need to read D1-D3 and D8 without grepping source.
 - The `_`-prefixed helper-module pattern is legitimized at ADR level,
  so future `benches/_*.py` shared helpers are free to be added.
 - The CLI's single-device convention and CCL's multi-SIP exception are
  shown to be consistent (D5) — they are orthogonal.
 - The rationale for ADR-0044's GEMM eval harness using env-var hooks
  (D9) is now ADR-pinned.
 - Indices are explicitly unstable (D7), so any CI code calling
  `kernbench run --bench 3` is flagged for review after this ADR is
  accepted.
@@ -0,0 +1,327 @@
 # ADR-0046: TLContext — Kernel-side `tl.*` API Contract
 ## Status
 Accepted (2026-05-22).
 Documents the set of `tl.*` primitives exposed by
 `src/kernbench/triton_emu/`'s `TLContext`, their semantics, and the two
 execution-mode contracts (command-list / greenlet runner). ADR-0014/0020
 defines the PE pipeline and the 2-pass execution model, but **the `tl.*`
 surface that bench kernel functions call** had no ADR-level coverage.
 ## First action
 When `TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id,
 num_cubes, scratch_base, scratch_size)` is instantiated, the first action
 is to initialize six categories of state:
 - `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
  values that `tl.program_id` / `tl.num_programs` will return.
 - `self._dispatch_cycles` — cycle count emitted as `PeCpuOverheadCmd(cycles)`
  at the start of every `tl.*` API call.
 - `self._runner` — `KernelRunner` instance (present → greenlet mode;
  absent → command-list mode).
 - `self._commands: list[PeCommand] = []` — command-list accumulator
  (command-list mode only).
 - `self._handle_counter = 0`, `self._completion_counter = 0` — counters
  for generating TensorHandle / CompletionHandle ids.
 - `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
  PE-local scratch region (used for math/dot/composite output handle
  addresses).
 In short, **TLContext's first act is "record where (sip/cube/pe) and at
 what scale (num_programs/num_cubes) this kernel instance runs, and pick
 its dispatch mode (runner present or not)"**. No SimPy event is created
 and no command is emitted at this moment.
 The runtime first action happens when the kernel function first calls a
 `tl.<api>()`. The standard entry for every `tl.*` API is:
 1. Call `self._emit_dispatch_overhead()` — if `dispatch_cycles > 0`,
   immediately `_emit` a `PeCpuOverheadCmd(dispatch_cycles)`.
 2. Per-API processing (TensorHandle creation, command construction).
 3. `self._emit(cmd)` — in runner mode this `greenlet.switch()`es the cmd
   to SimPy; in command-list mode it appends to `self._commands`.
 ## Context
 The `tl.*` surface consists of `TLContext`'s methods, and the `tl`
 parameter received by a kernel function is one of these objects. The
 contract the user (bench author) sees:
 - Which primitives exist.
 - What data flow each primitive triggers (DMA / compute / IPCQ /
  metadata-only).
 - How a TensorHandle's `space` and `addr` are decided.
 - The difference between command-list and greenlet modes.
 ADR-0014 (PE pipeline) defines the PeCommands consumed by PE_SCHEDULER,
 but how `tl.*` emits them is a code-only convention. ADR-0020 (2-pass
 data execution) mentions greenlet mode in D3 but does not pin down the
 signature difference (return-value handling) between the runner /
 non-runner paths. This ADR fills the gap.
 ## Decision
 ### D1. The `tl` parameter is a `TLContext` instance
 A bench kernel function has the signature:
 ```python
 def _kernel(arg1, arg2, ..., tl, **kwargs):
    ...
 ```
 `tl` is a `kernbench.triton_emu.tl_context.TLContext` instance. The name
 imitates real Triton's `triton.language` module; the actual Triton
 module is **not** passed in.
 The kernel is plain Python — no `yield` or `async`. `tl.*` calls produce
 SimPy events, but to the caller they appear synchronous because in
 greenlet mode the KernelRunner relays between SimPy and the kernel
 (ADR-0020 D3).
 ### D2. Two execution modes — command-list / greenlet runner
 - **Command-list mode (`runner is None`)**: `tl.*` calls append PeCommand
  to `self._commands`. DMA / GEMM / Math consume no SimPy time and return
  metadata-only TensorHandles (`data=None`). PE_SCHEDULER / sim_engine
  later replays the command sequence in time.
 - **Greenlet runner mode (`runner is not None`)**: `tl.*` calls
  `self._emit(cmd)` → `runner.switch_to_simpy(cmd)`, handing control to
  the parent greenlet (SimPy). The parent distributes the cmd to
  components, consumes SimPy time, and (for DMA reads) returns real numpy
  data. The kernel receives the result and continues to the next line
  (the data-aware execution model from ADR-0020 D3).
 The choice of mode is decided by whether a KernelRunner is injected into
 the TLContext. The `tl.*` methods themselves are mode-blind — they go
 through `_emit()` uniformly.
 ### D3. Primitive categories
 #### D3.1. Reference (no DMA, metadata only)
 - `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: create a handle
  referencing HBM data without issuing DMA. Used when the scheduler
  streams the data per-tile (e.g., the b operand of a composite GEMM).
 #### D3.2. Data movement (blocking, DMA engine)
 - `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → handle.
  Emits `DmaReadCmd`. In greenlet mode the returned handle's `.data`
  carries real numpy data; in command-list mode it is a placeholder.
  The handle has `space="hbm"`, `pinned=True`.
 - `tl.store(ptr, handle) -> None`: TCM → HBM. Emits `DmaWriteCmd`. In
  greenlet mode, when `handle.data` is present, `_store.write("hbm",
  ptr, data)` runs first (visibility = issue time, ADR-0020 D3).
 #### D3.3. GEMM / compute (blocking)
 - `tl.dot(a, b) -> TensorHandle`: `a @ b`. Both operands must live in
  TCM; shapes `(M,K) × (K,N) → (M,N)`. Emits `GemmCmd`; the output
  handle is allocated from PE-local scratch via
  `_make_compute_out(shape, dtype)`.
 - `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
  acc_dtype=None, tile_shape=None) -> CompletionHandle`: non-blocking
  tiled pipeline. Emits `CompositeCmd`. `epilogue` is a list of dicts,
  each with `"op"` plus op-specific fields and an optional `"scope"`
  (k_tile / output_tile). Unknown ops or missing fields raise
  ValueError immediately. The returned CompletionHandle synchronizes
  via `tl.wait(h)`.
 #### D3.4. Math: unary (blocking)
 - `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
  `tl.cos(x)`, `tl.sin(x)` — each emits `MathCmd(op=<name>,
  inputs=(x,), out=)`. `out` is scratch-allocated with the same
  shape/dtype as `x`.
 #### D3.5. Math: binary (blocking)
 - `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
 - `tl.fma(a, b, c)` — `a*b + c`. Three inputs.
 - `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
 - `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
 - `tl.softmax(x, axis=-1)` — a single `MathCmd(op="softmax")` so timing
  accounts at one dispatch. Phase 2 DataExecutor expands it to the
  canonical (x-max → exp → sum → div) sequence.
 #### D3.6. Reduction (blocking)
 - `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — return an
  output handle with the axis size collapsed to 1. Emit
  `MathCmd(op=<name>, inputs=(x,), out=, axis=axis)`.
 #### D3.7. Index / scalar (PE_CPU, no engine)
 - `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE
  index), `axis==1` → cube_id (ADR-0022).
 - `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (PEs per
  cube), `axis==1` → num_cubes.
 - `tl.arange(start, end, dtype="i32") -> TensorHandle`: an index range
  in TCM. No command emitted.
 - `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape,
  value, dtype="f16") -> TensorHandle`: TCM placeholder. No command
  emitted.
 #### D3.8. Scalar helpers (no command, no engine)
 - `TLContext.cdiv(a, b) -> int` (static): ceiling division
  `-(-a // b)`. Mirrors real Triton's `tl.cdiv`.
 #### D3.9. Metadata-only (no compute, no DMA)
 - `tl.trans(x) -> TensorHandle`: a new handle with the last two dims
  swapped. Shares `addr` and `data`; no command emitted.
 #### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
 - `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
  dtype="f16", space="tcm") -> None`: blocking send. Accepts either
  handle form or raw-address form. Emits `IpcqSendCmd`. The handle's
  `.data` snapshot rides along on the command — avoiding the race
  where a later inbound IPCQ overwrites the slot before the outbound
  PE_DMA reads it.
 - `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
  dst_space=None) -> TensorHandle`: blocking recv. Providing both
  `dst_addr` and `dst_space` enters "copy_to_dst" mode; otherwise
  "return_slot" mode. In greenlet mode the handle's `.data` carries
  the real data.
 - `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
  **DIAGNOSTIC ONLY**. Has the same blocking-arrival semantics as
  `tl.recv` but skips the slot-read latency charge (slot-IO + PE↔bank
  fabric drain). Used in the pe2pe overview plot for an apples-to-apples
  comparison against `tl.store`. Production kernels MUST NOT use it —
  the diagnostic flag is isolated in its own command branch
  (`consume=False`) so it cannot be accidentally enabled.
 - `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
  recv. Returns a `RecvFuture`; resolved later by `tl.wait(future)`.
 #### D3.11. Composite + control
 - `tl.composite(...)`: see D3.3.
 - `tl.wait(handle=None)`: wait on a `CompletionHandle` (composite), a
  `RecvFuture` (async recv), or `None` (all pending composites).
 - `tl.cycles(n)`: declare a scalar PE_CPU overhead. Emits
  `PeCpuOverheadCmd(cycles=n)`.
 ### D4. TensorHandle arithmetic operators — thread-local TLContext
 At module load, `tl_context.py::_enable_tensor_ops()` runs and patches
 `TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__`. Each
 operator calls `_binary_math` on the active TLContext stored in a
 module-level thread-local `_ctx`.
 So inside a kernel, `c = a + b` is equivalent to emitting
 `MathCmd(op="add", inputs=(a, b), out=)` and returning a new
 TensorHandle.
 Active-TLContext management:
 - `TLContext._set_active(ctx)`: set the active ctx for the current
  thread/greenlet.
 - `TLContext._get_active()`: read it (RuntimeError if unset).
 - `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. Sets active
  on entry, runs the kernel, restores `None` on exit.
 `KernelRunner` re-asserts `_set_active(tl)` inside its `_switch_kernel`
 just before resuming the kernel, so a sibling PE runner that overwrote
 the thread-local context is correctly recovered.
 ### D5. Scratch allocator — compute output handles
 Ops that produce a result — `tl.dot`, `tl.exp`, `tl.add` (via
 TensorHandle `__add__`), etc. — call `_make_compute_out(shape, dtype)`
 to obtain a 16-byte-aligned scratch address. The address is published
 with `space="tcm"`, so the handle can later be the source of a
 `tl.send` / `tl.store`.
 When `_scratch_base == 0` (e.g., command-list mode), the address is 0
 and the handle cannot be a send/store source (in that case, only
 `tl.load`-returned handles are valid sources).
 When the cursor exceeds `_scratch_size` (default 1 MiB), a
 RuntimeError is raised. The cursor must reset between kernel
 invocations (current code naturally satisfies this: KernelRunner
 creates a fresh TLContext each time).
 ### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
 Every non-metadata `tl.*` call starts with `_emit_dispatch_overhead()`,
 which — when `dispatch_cycles > 0` — emits
 `PeCpuOverheadCmd(dispatch_cycles)`. This models the cycles PE_CPU
 spends dispatching the command.
 Defaults:
 - `TLContext.__init__`'s `dispatch_cycles` parameter default: `1` cycle.
 - TLContext built by `KernelRunner`: `0` cycles (greenlet mode handles
  cycle accounting differently — aligned with ADR-0020 D3 intent).
 ### D7. Kernel registry (`triton_emu/registry.py`)
 A separate `_kernels: dict[str, Callable]` holds the name → function
 mapping:
 - `register_kernel(name, fn)`: ValueError on duplicate.
 - `get_kernel(name)`: KeyError if missing.
 - `clear_registry()`: test-only.
 `RuntimeContext.launch(kernel_name, kernel_fn, *args)` overwrites
 `_kernels[kernel_name] = kernel_fn` on every call (last-call-wins,
 idempotent) — consistent with ADR-0045 D8's `launch` behavior.
 PE_CPU looks up `KernelRef.name` in the registry and runs the function
 through KernelRunner.
 ## Alternatives Considered
 ### A1. Fold `tl.*` into ADR-0014 / ADR-0020
 Rejected. ADR-0014 covers the PE pipeline (sim_engine-side consumption
 of PeCommands); ADR-0020 covers 2-pass execution (Phase 1 timing /
 Phase 2 data). The `tl.*` surface is what the kernel author touches; a
 dedicated ADR improves findability and onboarding.
 ### A2. Deprecate command-list mode
 Rejected (currently). Simple unit tests and kernel verification benefit
 from the lighter command-list path — it exposes a PeCommand sequence
 inspector without requiring greenlet machinery. When greenlet-mode
 semantics (real data, Phase 2) are needed, D2 explicitly selects them.
 ### A3. Remove TensorHandle arithmetic operators
 Rejected. They mimic real Triton kernel ergonomics (e.g., `c = a + b`),
 and the thread-local active-ctx pattern works cleanly. The explicit
 function-form (`tl.add(a, b)`) is also exposed in D3.5, so the
 operators are syntactic sugar.
 ### A4. Expand softmax into the explicit sequence (max → exp → sum → div)
 Partially adopted. `tl.softmax` is a single `MathCmd(op="softmax")` for
 timing accounting (D3.5), but Phase 2 DataExecutor expands it to the
 canonical sequence for real-data computation. Timing model atomic,
 data model expanded — the two split intentionally.
 ## Consequences
 - Every `tl.*` primitive a bench author meets is classified and defined
  in a single ADR. Paired with ADR-0045 D8's host-side surface
  (`torch.empty` etc.), the inside-kernel and outside-kernel authoring
  guides are now complete.
 - The command-list / greenlet difference is pinned in D2, so any new
  `tl.*` primitive that follows the `_emit()` pattern auto-supports
  both modes.
 - The thread-local active-ctx pattern (D4) is justified at ADR level,
  clarifying who owns the reset responsibility when multiple PE
  runners share a thread (KernelRunner.run's contract restores active
  inside `_switch_kernel`).
 - `tl.recv_no_consume`'s diagnostic isolation (D3.10) is hardened in
  ADR form — accidental production use is blocked by a separate
  command branch.
 - The registry (D7) gets its own D-section, formalizing the
  name-collision and dynamic-re-registration semantics.
@@ -0,0 +1,259 @@
 # ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
 ## Status
 Accepted (2026-05-22).
 Pins down what `runtime_api/distributed.py`'s `AhbmCCLBackend` +
 `DistributedContext` actually install — i.e., the entry point
 `torch.distributed.init_process_group(backend="ahbm")` — and how
 `all_reduce`/`barrier`/`get_rank` etc. are implemented. ADR-0023 D11
 mentions the "torch.distributed compatibility" intent, but **the backend
 itself** had no ADR-level coverage.
 ## First action
 `RuntimeContext.__post_init__` automatically constructs a
 `DistributedContext()` and attaches it to `self.distributed`. The first
 action at that moment:
 1. `self._backend: AhbmCCLBackend | None = None` — uninitialized.
 2. `self._rank_by_greenlet: dict = {}` — greenlet-local rank registry
   (ADR-0024 D2).
 3. The caller (RuntimeContext) sets `dc._ctx_ref = self` so subsequent
   `init_process_group` can reach `ctx.engine` / `ctx.spec` / `ctx.launch`.
 In short, **DistributedContext's first act is "attach to RuntimeContext
 with a back-reference and leave the backend slot empty"**. Actual
 backend installation (IPCQ install, world_size derivation, algorithm
 module import) happens only when user code calls
 `torch.distributed.init_process_group(backend="ahbm")`.
 At that moment, `init_process_group`'s first action is:
 1. If `backend != "ahbm"`, raise `ValueError("Unsupported backend ...")`
   immediately.
 2. If `getattr(self, "_ctx_ref", None)` is None,
   `RuntimeError("DistributedContext not bound to a RuntimeContext")`.
 3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — inside this
   constructor, ccl.yaml is loaded, the algorithm module is imported,
   world_size is derived, SFR is configured, and IPCQ is installed.
 4. `self._backend._dist_ctx = self` — the backend gets a back-reference
   so it can read `_rank_by_greenlet`.
 ## Context
 The `AhbmCCLBackend` exists so that PyTorch DDP collective calls
 (`init_process_group`, `all_reduce`, etc.) work unchanged and bench code
 reads identically to a real DDP training script (in line with
 ADR-0024 + ADR-0027's launcher model).
 The backend's responsibilities:
 - At `init_process_group` time, install the **IPCQ neighbor table once**
  (analogous to NCCL communicator creation).
 - For each `all_reduce(tensor, op="sum")`, dispatch the configured
  algorithm's kernel function via `ctx.launch(...)`.
 - Answer `get_world_size` / `get_rank` consistently from the
  greenlet-local rank registry plus ccl.yaml/topology.
 ADR-0023 D10 (IPCQ install plan) and ADR-0024 (SIP launcher) touch
 parts of this, but **the backend's own responsibility scope and decision
 order** are not pinned anywhere. This ADR fills that gap.
 ## Decision
 ### D1. The backend is created only at `init_process_group(backend="ahbm")` time
 `DistributedContext` starts with `_backend = None`. The backend object
 does not exist until the user calls
 `dist.init_process_group(backend="ahbm")`. Any other API
 (`is_initialized`, `get_world_size`, `all_reduce`, `barrier`) called
 while `_backend` is None raises
 `RuntimeError("Default process group has not been initialized...")` via
 the `_ensure_initialized` helper.
 `backend != "ahbm"` raises `ValueError` immediately. Other backend names
 (`nccl`, `gloo`, etc.) are not recognized.
 ### D2. world_size resolution priority — algorithm > defaults > topology
 `AhbmCCLBackend._resolve_world_size` (ADR-0024 D1):
 1. If `ccl.yaml`'s algorithm entry has `world_size`, use it.
 2. Else if `defaults.world_size` is set, use it.
 3. Else fall back to `spec.system.sips.count` (the topology's SIP count).
 The default interpretation is **rank = SIP** (ADR-0024). Cube/PE-level
 parallelism is expressed inside each rank via DPPolicy and does not
 affect world_size. An explicit `ccl.yaml` override is preserved for the
 legacy "rank = flat PE index" test path.
 User arguments to `init_process_group(world_size=..., rank=...)` are
 **accepted but ignored** (same as real PyTorch's `RANK` / `WORLD_SIZE`
 env vars).
 ### D3. `init_process_group` performs four installation steps
 Inside `AhbmCCLBackend.__init__`, in order:
 1. **Load ccl.yaml**: `kernbench.ccl.install.load_ccl_config()` →
   `resolve_algorithm_config(_cfg_all)` produces the merged config for
   `defaults.algorithm` (or the user-specified algorithm).
 2. **Import algorithm module**:
   `importlib.import_module(self._merged["module"])`. The module must
   expose a `kernel` function, a `kernel_args(world_size, n_elem,
   cube_w, cube_h)` helper, and optionally a `TOPO_NAME_TO_KIND` map.
 3. **Resolve world_size** (D2).
 4. **Collect topology metadata** from `spec`: `n_sips`, `sip_topo`
   (`ring_1d` default), `cube_w`/`cube_h`, `sips.w`/`sips.h`. When the
   SIP topology is not `ring_1d`, derive `_sip_topo_w/h` from explicit
   `w`/`h` or via square-root (require `w*h == n_sips`). Mismatch raises
   `ValueError`.
 5. **Install SFR + IPCQ**:
   `kernbench.ccl.sfr_config.configure_sfr_intercube_multisip(engine,
   spec, self._merged)`. This pushes IPCQ neighbor tables to every
   SIP/cube's pe0 (one-time setup analogous to NCCL communicator
   creation).
 If the order changes (e.g., SFR runs before the algorithm module
 loads), partial initialization can result. So D3 is treated as an
 atomic 4-step block — on failure the backend remains uninstalled.
 ### D4. Greenlet-local rank binding (ADR-0024 D2)
 `DistributedContext._rank_by_greenlet: dict[greenlet, int]` maps spawned
 worker greenlets to their ranks. When the bench launcher (e.g.,
 `torch.multiprocessing.spawn`) spawns a worker, it registers via
 `dc._bind_rank(g, rank)`.
 `get_rank()` looks up `getcurrent()`'s greenlet. Unregistered greenlets
 fall back to 0 — preserves single-driver / test compatibility.
 The backend reads the current greenlet's rank from
 `_dist_ctx._rank_by_greenlet` during `all_reduce` (D5).
 ### D5. `all_reduce(tensor, op="sum")` behavior
 Validation:
 - `op != "sum"` → `NotImplementedError`. Current kernels only
  implement add reduction.
 - `tensor._handle is None` → `RuntimeError("not deployed")`.
 - `tensor._handle.shards` empty → `RuntimeError("no shards")`.
 Preparation:
 - `n_elem = shards[0].nbytes // tensor.itemsize` — element count of a
  single shard.
 - `kernel_fn = self._algo_module.kernel` — the algorithm module's entry
  function (imported in D3).
 - Decide effective cube dims: if the first SIP has just 1 cube, use
  `(1, 1)`; otherwise use the topology's `cube_w`/`cube_h`. This
  naturally absorbs TP runs that use only a subset of cubes.
 - `kernel_args = self._algo_module.kernel_args(world_size, n_elem,
  cube_w, cube_h)` — the algorithm decides which arguments to pass to
  its kernel.
 Dispatch:
 - Resolve the current greenlet's rank via
  `_rank_by_greenlet.get(g, 0)`.
 - Append `extra_args = (sip_rank, sip_topo_kind, sip_topo_w,
  sip_topo_h)`.
 - `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor,
  *kernel_args, *extra_args, _defer_wait=True)` — `_defer_wait=True`
  delegates collective drain to the main scheduler (ADR-0027 D0.4).
 Drain:
 - If the parent greenlet is alive (multi-greenlet mode), enqueue
  `_pending_collective_handles` and switch to parent. The main
  scheduler drains after all ranks have launched.
 - If single-driver mode, drain inline:
  `for h, _sip_id, meta in pending: self.ctx.wait(h, _meta=meta)`.
 ### D6. `barrier()` is a no-op (single-driver model)
 kernbench runs all ranks as greenlets inside a single Python process,
 so no cross-process synchronization is needed. `barrier()` is callable
 but does no synchronization. Kept for real-PyTorch API compatibility so
 callers don't get `NotImplementedError`.
 If multi-process kernbench (SimPy event loop per process) is introduced
 in the future, D6 needs a superseding ADR.
 ### D7. Semantics of `get_rank` / `get_world_size` / `get_backend`
 - `get_rank()` (D4): the current greenlet's bound rank; unregistered → 0.
 - `get_world_size()` (D2): the world_size resolved by the backend in D3.
 - `get_backend()`: always the literal string `"ahbm"`. Calling before
  backend exists triggers `_ensure_initialized`'s RuntimeError.
 Differences vs. real PyTorch:
 - Real PyTorch `get_rank()` is a process-global value; here it is
  greenlet-local. Inside a spawned worker → the worker's rank; in the
  main thread → 0. Bench authors should expect meaningful ranks only
  inside worker functions.
 ### D8. Supported API surface (final)
 `DistributedContext` exposes:
 - `init_process_group(backend="ahbm", world_size=None, rank=None,
  **kwargs)`
 - `is_initialized() -> bool`
 - `get_world_size() -> int`
 - `get_rank() -> int`
 - `get_backend() -> str`
 - `all_reduce(tensor, op="sum") -> None`
 - `barrier() -> None`
 - (internal) `_bind_rank(g, rank)`
 Other PyTorch distributed APIs (`broadcast`, `reduce`, `all_gather`,
 `gather`, `scatter`, point-to-point `send/recv`, etc.) are **not
 implemented**. Kernel-level expression is available via
 `tl.send`/`tl.recv` (ADR-0046 D3.10), but the `dist.*` surface does not
 expose them. If additional collectives are needed, add a paired
 (algorithm module, `DistributedContext` method) and extend D8.
 ## Alternatives Considered
 ### A1. Create the backend in `RuntimeContext.__init__`
 Rejected. If `ccl.yaml` is missing or the algorithm module can't be
 imported, RuntimeContext construction would fail even when the bench
 does not use distributed features. Lazy creation at call time (D1) is
 the right semantics.
 ### A2. Always derive world_size from topology (no override)
 Rejected. ADR-0024 D1's "explicit override" path is used by legacy
 tests. Diagnostic scenarios that define PE-level ranks within a single
 SIP also need this escape hatch.
 ### A3. Silent fallback for unsupported `op`
 Rejected. If the user intends `op="prod"` / `"max"` / `"avg"` and silent
 `sum` runs instead, result validation gets very hard. Explicit
 `NotImplementedError` is safer.
 ### A4. Implement `barrier` as a SimPy event
 Rejected (currently). With single-driver semantics there is no
 cross-process synchronization to express, so a no-op is meaningfully
 correct. A fake-barrier SimPy event would add code complexity for no
 semantic gain. Revisit when multi-process kernbench arrives.
 ## Consequences
 - The 4-step installation (D3) for
  `torch.distributed.init_process_group(backend="ahbm")` is locked in,
  making clear where future collective algorithms must hook.
 - The priority order in D2 (algorithm > defaults > topology) makes the
  blast radius of ccl.yaml changes quickly knowable.
 - The no-op `barrier` (D6) is recorded so multi-process kernbench, if
  introduced, must explicitly supersede this ADR.
 - D8's list of unsupported APIs explicitly grounds the rejection
  message when users call, e.g., `dist.broadcast(...)`.
@@ -0,0 +1,278 @@
 # ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
 ## Status
 Accepted (2026-05-22).
 Pins down the free-list algorithm, page alignment, and coalescing rules
 used by `policy/address/allocator.py`'s `_FreeList` / `PEMemAllocator`
 and `va_allocator.py`'s `VirtualAllocator`. ADR-0001 (PhysAddr layout)
 and ADR-0011 (PA/VA/LA models) define the address schemes; the
 **allocation algorithms** had no ADR-level coverage.
 ## First action
 ### `_FreeList(capacity)`
 On construction: `self._capacity = capacity`, `self._used = 0`,
 `self._free = [(0, capacity)]`. The first act is **establishing the
 entire region as one free block** — the tuple `(offset=0,
 size=capacity)` is the sole entry in the free list.
 ### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
 On construction, builds two `_FreeList`s:
 - `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — the size of this PE's
  HBM slice (`hbm_bytes_per_cube // hbm_slices_per_cube`).
 - `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — equals
  `tcm_bytes_per_pe - tcm_scheduler_reserved_bytes` (the scheduler
  reservation is pre-deducted).
 So PEMemAllocator's first act is **constructing single-free-block
 HBM-slice and TCM regions for this PE**.
 ### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
 On construction: `self._va_base = va_base`, `self._va_size = va_size`,
 `self._page_size = page_size`, `self._used = 0`, `self._free =
 [(va_base, va_size)]`. The first act is **establishing one block from
 va_base to va_size and stashing page_size**.
 ## Context
 `runtime_api/context.py::_ensure_allocators` builds the allocator set
 in these stages:
 1. Read `hbm_total_gb_per_cube`, `hbm_slices_per_cube`, `tcm_size_mb`,
   per-target_device SIP range, etc. from `spec`.
 2. Pack everything into a frozen `AddressConfig`.
 3. For every combination in the target SIP range × cubes × PEs,
   construct one `PEMemAllocator(sip, cube, pe, cfg)` instance.
 4. Construct one `VirtualAllocator(va_base=0x1_0000_0000, va_size=64
   GiB, page_size=pe_mmu.page_size)`.
 Allocator responsibilities:
 - **PEMemAllocator**: PA-space allocation in the PE-local HBM slice /
  TCM (including PhysAddr encoding).
 - **VirtualAllocator**: device-wide VA allocation, page-aligned.
  `RuntimeContext._create_tensor` then pushes VA → PA mappings to
  components via `MmuMapMsg`.
 These algorithms are:
 - **First-fit**, kept simple.
 - The free-block list is **sorted by start offset**.
 - On `free()`, **adjacent blocks coalesce**.
 The rationale was not documented anywhere, so when someone asks "why
 not best-fit?", "why not a buddy allocator?", "why does partial-overlap
 free pass silently?", there was no anchor to answer from. This ADR
 provides it.
 ## Decision
 ### D1. `_FreeList` — offset-keyed first-fit + coalescing
 `policy/address/allocator.py::_FreeList`:
 - Internal representation: `list[tuple[int, int]] = [(start_offset,
  size), ...]` — sorted by start offset.
 - `alloc(nbytes)`:
  1. Iterate the free list from the front (first-fit).
  2. Take from the first block with `size >= nbytes`.
  3. Exact match → drop the block; otherwise shrink it to `(start +
     nbytes, size - nbytes)`.
  4. `_used += nbytes`; return the taken `start`.
  5. If no block fits, `AllocationError("overflow ... largest free
     block ...")`.
 - `free(offset, nbytes)`:
  1. `_used -= nbytes`.
  2. `bisect_left(self._free, (offset,))` finds the insertion index.
  3. If adjacent to the previous block (`prev_start + prev_size ==
     offset`), merge.
  4. If adjacent to the next block (`offset + nbytes == next_start`),
     merge.
  5. Insert the coalesced range at the right sorted position.
 This algorithm is weaker than best-fit / buddy on fragmentation, but
 the simulator's workload (mostly stack-like deploy/free) tolerates it.
 If the workload shape changes, D1 is a supersession candidate.
 ### D2. Partial-overlap free is **not** validated
 `_FreeList.free(offset, nbytes)` trusts the caller to pass the exact
 `(offset, nbytes)`. It does **not** verify:
 - That the range was actually allocated.
 - That the range does not overlap another allocated region.
 Reason: in a simulator context, callers always store the return value
 of `alloc()` and pass it back to `free()` — there is no external user
 input. Adding a safety check would cost O(N) per free and impact
 simulation wall-clock.
 If this trust model breaks (e.g., a code path lets two tensors point
 at the same PA), this ADR must be revisited.
 ### D3. `PEMemAllocator` — two channels for HBM/TCM
 `PEMemAllocator(sip_id, die_id, pe_id, cfg)` holds two `_FreeList`s:
 - `_hbm`: size `cfg.hbm_slice_bytes`.
 - `_tcm`: size `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
  tcm_scheduler_reserved_bytes`).
 `alloc_hbm(nbytes) -> PhysAddr`:
 - `_hbm.alloc(nbytes)` → offset.
 - `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id,
  pe_local_hbm_offset=offset, slice_size_bytes=cfg.hbm_slice_bytes)`.
 - Failure raises `AllocationError("HBM overflow ...")`.
 `free_hbm(pa, nbytes)`:
 - Recover PE-local offset via `pa.hbm_offset - pe_id *
  cfg.hbm_slice_bytes`.
 - `_hbm.free(offset, nbytes)`.
 `alloc_tcm(nbytes) -> PhysAddr`: similar; uses `PhysAddr.pe_tcm_addr`.
 `free_tcm(pa, nbytes)`: uses `pa.sub_offset` directly (TCM's PE-local
 offset equals its sub_offset).
 The allocator does not see the scheduler-reserved TCM region
 (`cfg.tcm_scheduler_reserved_bytes`) — it is pre-subtracted from the
 `_tcm` capacity. This is consistent with ADR-0014's PE_SCHEDULER
 internal-buffer reservation.
 ### D4. `VirtualAllocator` — page-aligned first-fit + coalescing
 `policy/address/va_allocator.py::VirtualAllocator`:
 - Internal representation: same sorted `list[tuple[int, int]]` as
  `_FreeList`. Initially `[(va_base, va_size)]`.
 - `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
 - `alloc(nbytes) -> int`:
  1. `aligned = _align_up(nbytes)`.
  2. First-fit a block with `size >= aligned`.
  3. Take `aligned` from the block's front; remove if exact.
  4. `_used += aligned`. Return the block's `start` (which is page-
     aligned).
  5. Failure → `VaAllocationError`.
 - `free(va, nbytes)`: free `_align_up(nbytes)` worth. Coalesces with
  the same algorithm as `_FreeList`.
 `page_size` has different defaults in two places:
 - `VirtualAllocator.__init__`'s parameter default: `2 MiB`. Direct-call
  tests receive this.
 - `RuntimeContext._ensure_allocators` when constructing the instance:
  `pe_mmu.attrs.get("page_size", 4096)` — uses
  `topology.yaml`'s `pe_mmu.attrs.page_size` if set, else falls back
  to `4 KiB`.
 The two defaults differ on purpose: `VirtualAllocator`'s standalone
 default (`2 MiB`) aligns with ADR-0039's PE_MMU stopgap default for
 direct-test ergonomics; the context fallback (`4 KiB`) is the safe
 minimum when `topology.yaml` doesn't specify a page size. The
 production path is always the latter (via `_ensure_allocators`), and
 when `topology.yaml` sets `page_size`, that value flows consistently
 into both the MMU and the VA allocator.
 If consistency breaks (e.g., VirtualAllocator instantiated with a
 page_size different from PE_MMU's), MMU `map()` falls into the
 sub-page region mode (ADR-0039 D3).
 VA range defaults: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64
 GiB`. These are hardcoded in `_ensure_allocators` and have no
 semantic meaning in ADR-0011's VA model — they simply reserve enough
 device-wide space without colliding with host code.
 ### D5. Lifecycle of allocator instances
 - `RuntimeContext._ensure_allocators` is lazy — called on the first
  `_create_tensor`.
 - The allocator dict (`self._allocators`) lives for the
  RuntimeContext's lifetime. A second deploy in the same process
  does not construct new objects.
 - `RuntimeContext.cleanup()` walks living tensors and calls
  `_free_tensor()`, which issues MMU unmaps + `va_allocator.free` +
  `pemem_allocator.free_hbm` — restoring the free lists. A subsequent
  RuntimeContext starts fresh.
 This per-RuntimeContext isolation guarantees deterministic deploy →
 cleanup → deploy sequences within a single process.
 ### D6. Allocator failure raises (no silent OOM)
 Both `_FreeList.alloc` and `VirtualAllocator.alloc` raise
 `AllocationError` / `VaAllocationError` when no block fits. The message
 includes "required size + largest available block" to distinguish
 fragmentation from true OOM.
 A silent fallback (e.g., allocating only as much as the largest free
 block) is strictly forbidden — a partially-allocated tensor reaching
 SimPy would cause routing / DMA to see incorrect PAs and silently
 corrupt simulation results.
 ### D7. One allocator per address space
 Physical address spaces are separated by PhysAddr sub-units (ADR-0001
 D2.3); each sub-unit gets its own allocator instance:
 - HBM slice → `PEMemAllocator._hbm`.
 - PE TCM → `PEMemAllocator._tcm`.
 - (Currently unused) M_CPU local memory, CUBE SRAM → would need their
  own allocators. Today these are handled as IPCQ-only slots (ADR-0023
  D9.7) and do not share PA space, so no free-list exists for them.
 When a cube-level SRAM allocator is needed,
 `_FreeList(cfg.sram_bytes_per_cube)` is added per-cube
 (`cfg.sram_bytes_per_cube` is already defined in `AddressConfig` —
 the data model is ready).
 ## Alternatives Considered
 ### A1. Best-fit / buddy allocator
 Rejected (currently). The workload's alloc/free pattern is stack-like
 (deploy order ≈ free order), so first-fit + coalescing controls
 fragmentation well enough. If long-running fragmentation appears in LLM
 kernel sweeps, a buddy-allocator ADR will replace D1.
 ### A2. Add partial-overlap free validation
 Rejected. D2's trust model plus the O(N) per-free cost makes this
 unattractive. A debug mode (e.g., `KERNBENCH_DEBUG` env var) that
 enables the check could be added later.
 ### A3. A unified allocator for VA and PA
 Rejected. VA space (64 GiB device-wide) and PA space (per-slice ~6
 GiB) have different semantic dimensions — VA is the kernel's view, PA
 is the device sub-unit's view. ADR-0011's VA model (MMU maps between
 the two) calls for separated allocators.
 ### A4. Multi-tier page sizes (large pages + small pages)
 Rejected (currently). A single page size (2 MiB) matches LLM kernel
 tensor sizes (a few MiB to GiB); smaller mappings are absorbed by
 ADR-0039 D3's sub-page region mode. Multi-tier paging would require
 extending the MMU model itself — a separate ADR candidate.
 ## Consequences
 - The allocator algorithm is pinned at ADR level (D1, D3, D4), so any
  future simulation scenario hitting fragmentation has a clear "we're
  using first-fit + coalescing" anchor to inspect.
 - D2's trust model is explicit, so any future code path that exposes
  alloc/free to direct user input will trigger this ADR's supersession
  early.
 - D7's one-allocator-per-sub-unit mapping is on record, so when M_CPU
  or SRAM need their own free-list, the addition point is obvious.
 - D4 captures the page_size dual-default and its production path
  (`_ensure_allocators` always wins), letting future `topology.yaml`
  `page_size` changes be assessed against ADR-0039's stopgap
  interaction quickly.
@@ -0,0 +1,247 @@
 # ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
 ## Status
 Accepted (2026-05-22).
 Pins down the traffic-pattern catalog, formula-vs-actual comparison, and
 invariant checks (monotonicity, D2H ≥ H2D, etc.) exposed by
 `probes/probe.py::run_probe(...)`. ADR-0010 (CLI surface) enumerates the
 `kernbench probe` subcommand, but **what probe actually measures** and
 **which invariants it judges PASS/FAIL** had no ADR-level coverage.
 ## First action
 `run_probe(topology_path, case_filter=None)` performs four startup steps:
 1. `Path(topology_path).expanduser().resolve()` → absolute path.
 2. `load_topology(path)` → `TopologyGraph` (graph + spec).
 3. `_build_edge_map(graph)` → a `{(src, dst): Edge}` lookup table.
 4. Instantiate `AddressResolver(graph)` + `PathRouter(graph)`.
 Then it sets `nbytes = 32768` (= 32 KiB, the summary-table reference
 size) and `show_all = (case_filter is None or case_filter == "all")`.
 In short, **probe's first act is "load the topology once and prepare
 edge map / resolver / router, plus pin 32 KiB as the standard measurement
 size"**. After that, the H2D → D2H → PE DMA categories execute in
 separate `GraphEngine` instances (no cross-talk between cases).
 ## Context
 `kernbench probe` was introduced as a verification tool for these
 purposes:
 - **Manual ground truth**: when a real-simulation result (`kernbench run
  --bench ...`) shows abnormal latency, derive the answer for a simple
  traffic pattern in isolation and compare.
 - **Formula vs actual**: check whether the analytical model
  (wire latency + overhead + drain) matches the simulator's
  `total_ns`. A mismatch points to which simplifying assumption in
  ADR-0033 is missing.
 - **Monotonicity check**: latency should grow monotonically with hop
  count.
 - **Utilization sweep**: a BW-utilization table across data sizes
  (4 KiB ~ 1 MiB).
 Without an ADR for this tool:
 - Adding a new traffic-pattern category (e.g., MCpuDma, IPCQ) is hard
  because the table format / measurement units of existing categories
  aren't documented at the ADR level.
 - The basis for the monotonicity check (hop count? cube distance? wire
  length?) is ambiguous.
 - The reference size 32 KiB and the sweep `[4 KiB, 16 KiB, 64 KiB, 256
  KiB, 1 MiB]` are only discoverable by reading source.
 ## Decision
 ### D1. Three case categories — H2D / D2H / PE DMA
 Each category has a distinct data path in the topology and gets its own
 summary table + sweep table + route-detail block.
 - **H2D (Host → Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
  dst_pe=0, pattern="zero")` flows along `pcie_ep → io_cpu → m_cpu →
  hbm_ctrl`. The cube index varies the hop count:
  - h2d-1hop: cube=0, hops=1
  - h2d-2hop: cube=4, hops=2
  - h2d-3hop: cube=8, hops=3
  - h2d-4hop: cube=12, hops=4
 - **D2H (Device → Host Read)**: `MemoryReadMsg(src_sip=0, src_cube,
  src_pe=0)`. Total latency = forward command path + reverse data path.
  Same 4-hops category as H2D.
 - **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe,
  dst_pa)`. Five cases cover varying cube/PE positions:
  - pe-local-hbm: same cube, same PE
  - pe-same-half-hbm: same cube, different PE (PE 1)
  - pe-cross-half-hbm: same cube, far PE (PE 4)
  - pe-cross-cube-hbm-best: adjacent cube (cube 1)
  - pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
 The cube indices 4/8/12 (H2D) and 1/4/15 (PE DMA) are meaningful for a
 4 × 4 cube mesh (`sip.cube_mesh.w=4, h=4`); changes to the mesh size
 require these to be updated in lockstep.
 ### D2. Standard measurement size — `nbytes = 32768` (32 KiB)
 Every case in the summary table runs once with `nbytes=32768`. 32 KiB
 was chosen because:
 - DMA overhead and BW drain are balanced — neither dominates.
 - It compares cleanly against the one-shot transfer size of several
  sub-units (TCM, register file).
 Per-size utilization variations are shown in a separate sweep table
 (D3).
 ### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
 `SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`,
 `SWEEP_LABELS = ["4KB", "16KB", "64KB", "256KB", "1MB"]`. Per size:
 ```
 drain   = nbytes / bottleneck_bw
 total   = overhead + wire + drain
 eff_bw  = nbytes / total
 util%   = eff_bw / bottleneck_bw × 100
 ```
 When `bn_bw is None or <= 0`, the column shows 0.0 %. The intent: the
 table shows in one view how small transfers become overhead-bound and
 large transfers become drain-bound as hop count rises.
 ### D4. Measured columns — actual / formula / breakdown
 Per-case columns:
 - `Actual` (total_ns): the SimPy run's `trace["total_ns"]`.
 - `Ovhd`: sum of `node.attrs["overhead_ns"]` along the path (formula).
 - `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
 - `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
 - `Ovhd%` / `Drain%`: each portion as a percentage of Actual. Wire is
  usually too small to display.
 - `Eff.BW`: `nbytes / total_ns` (measured BW).
 - `BN.BW`: bottleneck bandwidth (formula). The minimum edge BW along
  the path. Missing edge BW shows "-".
 - `Util%`: `Eff.BW / BN.BW × 100`. 100 % means the single-stream BW
  upper bound is reached.
 A large gap between the formula sum (`wire + ovhd + drain`) and Actual
 signals a factor the simplified model misses (a place to inspect
 ADR-0033's assumptions).
 ### D5. Automatic invariant checks — PASS/FAIL
 The following invariants are reported with `[v] PASS` / `[x] FAIL`:
 - **H2D / D2H monotonic increase**: as hop count rises, actual latency
  must grow monotonically. `all(lats[i] < lats[i+1] for ...)`.
 - **D2H ≥ H2D**: for the same hop index, D2H ≥ H2D (D2H has both
  forward command and reverse data legs). `all(d2h[i].total >=
  h2d[i].total)`.
 - **PE DMA best < worst**: cross-cube best (adjacent) latency must be
  less than cross-cube worst (diagonal).
 - **PE DMA local vs remote**: prints the local BN BW vs remote BN BW
  side-by-side (informational, not PASS/FAIL).
 When a check fails, a single clear line surfaces the regression for
 human review.
 ### D6. Route detail — per-hop timestamp trace
 After the summary and sweep tables, each case's path and cumulative
 per-hop timestamps (`_hop_timestamps`) appear in a separate section:
 - H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
  (`m_cpu → hbm_ctrl`) + per-hop trace.
 - D2H: forward (cmd, no data) and reverse (data) traces shown
  separately.
 - PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
 Each hop's timestamp is cumulative `wire_ns + overhead_ns`. The
 terminal hop's annotation appends `drain:Xns`. Bottleneck edges are
 marked `<BN:XXGB/s>` so they are visually identifiable.
 ### D7. Semantics of the `case_filter` argument
 - `None` or `"all"`: run all cases (default).
 - Other strings: run only the case whose name matches exactly. Example:
  `kernbench probe --case h2d-2hop`.
 Within a category, cases with `name != case_filter` are skipped; if
 only one data point remains, the category's monotonicity / D2H ≥ H2D
 comparisons are naturally skipped.
 The CLI parser's `--case` default is `"all"`, so omitting it runs
 everything.
 ### D8. Fresh GraphEngine per case
 Each of the 4 H2D, 4 D2H, and 5 PE DMA cases runs in **its own
 GraphEngine** (`engine = GraphEngine(graph)`). Reasons:
 - Isolate accumulated state (op_log, completion tracking, allocators)
  so cases do not cross-talk.
 - Guarantee one case's traffic does not perturb another case's BW
  measurement.
 This isolation lets probe results be interpreted as **single-flow**
 per-case latency. Multi-flow contention measurement is handled by
 separate tooling (e.g., the `pe2pe_overview` plot or ADR-0033's
 multi-flow merging model).
 ### D9. Output-format stability
 probe's stdout is meant for humans; precise column widths, separators,
 and whitespace are **not** a machine-readable contract. Automated tools
 that wish to parse probe output should use a separate JSON-output mode
 (not yet implemented).
 The `[v]` / `[x]` prefix on PASS/FAIL lines is a stable CI grep anchor.
 ## Alternatives Considered
 ### A1. Register probe as another bench (`@bench(name="probe")`)
 Rejected. probe is a verification tool, not a bench — multi-engine
 execution for sweeps/analysis and PASS/FAIL invariant output are
 essential, none of which fits ADR-0045's "single device + single
 RuntimeContext" bench model.
 ### A2. Exit code 1 on monotonicity violation
 Rejected (currently). probe is positioned as a human inspection tool —
 PASS/FAIL is printed and exit is 0. A wrapper can `grep "\[x\]"` to
 decide. A future `--strict` flag could opt into non-zero exits.
 ### A3. Externalize the case catalog to YAML
 Rejected (currently). The 8 cases (4 H2D + 4 D2H + 5 PE DMA = 13 total)
 are hardcoded and their semantics are tightly bound to the mesh
 topology. Moving cube-index meaning (4, 8, 12 / 1, 4, 15) into YAML
 would require separate documentation and lose cohesion. Externalize
 only when case additions become frequent.
 ### A4. Add multi-flow contention measurement
 Rejected (out of probe scope). D8's single-flow isolation is probe's
 core intent. Multi-flow contention belongs in a different area of the
 ADR-0033 latency model — either a separate tool or a new case
 category.
 ## Consequences
 - probe's case catalog (D1) and measurement units (D2/D3) are pinned at
  ADR level, so new traffic categories know which table format to
  follow.
 - The semantics of the formula-vs-actual columns (D4) are locked in, so
  questions like "why is Drain% 5 % or 70 %?" can quickly be linked to
  ADR-0033 assumption checks.
 - Automatic invariant checks (D5) are pinned, so latency-model changes
  immediately catch monotonicity / D2H ≥ H2D regressions.
 - D8's case-isolation is explicit, so probe results are safe to read as
  single-flow measurements. If multi-flow is needed, a separate tool
  track is clearly required.
 - A2's strict-mode flag is recorded as a follow-up so CI integration
  has a minimal change path when requested.
@@ -0,0 +1,322 @@
 # ADR-0050: CCL Algorithm Module Contract — `ccl/algorithms/*.py`
 ## Status
 Accepted (2026-05-22).
 Pins down the interface, kernel signature, and addition workflow that a
 module under `src/kernbench/ccl/algorithms/` must satisfy in order to be
 used as a collective algorithm by the AHBM CCL backend (ADR-0047).
 ADR-0047 D3 states only that "the algorithm module must expose `kernel`,
 `kernel_args`, optionally `TOPO_NAME_TO_KIND`"; **the contract an
 algorithm-module author needs to follow** has had no ADR-level coverage.
 This ADR pairs with ADR-0045's bench-module contract.
 ## First action
 An algorithm module is imported at two moments:
 1. **AHBM backend entry**: when user code calls
   `dist.init_process_group(backend="ahbm")`,
   `AhbmCCLBackend.__init__` runs
   `self._algo_module = importlib.import_module(self._merged["module"])`.
   At module level, the following occur first:
   - Topology-kind integer constants like `SIP_TOPO_RING/TORUS/MESH`
     are bound in the module namespace.
   - The `TOPO_NAME_TO_KIND` dict is bound; the backend reads it via
     `getattr(self._algo_module, "TOPO_NAME_TO_KIND", None)`.
   - `kernel_args` function is defined for the caller.
   - The actual algorithm function (e.g.,
     `allreduce_intercube_multidevice`) is defined.
   - At the bottom of the module, `kernel = allreduce_intercube_multidevice`
     publishes the alias.
 2. **ccl.yaml install stage**:
   `kernbench.ccl.install.install_ipcq` imports the same algorithm
   module while pushing the IPCQ neighbor table.
 In short, **the algorithm module's first act is "publish topology-kind
 constants, the `TOPO_NAME_TO_KIND` dict, the `kernel_args` function, and
 the `kernel` alias into the module namespace"** — all as import-time
 side effects, no separate initialization call.
 ## Context
 `AhbmCCLBackend` (ADR-0047), at process-group creation, dynamically
 imports a module path obtained from `ccl.yaml`'s `defaults.algorithm` (or
 a user-specified algorithm). The backend expects four things from the
 module:
 - `kernel`: the collective's entry function.
 - `kernel_args(world_size, n_elem, cube_w=, cube_h=) -> tuple`: a tuple
  packing the kernel's positional arguments.
 - `TOPO_NAME_TO_KIND` (optional): a dict mapping `topology.yaml`'s
  `sips.topology` string (e.g., `"ring_1d"`, `"torus_2d"`,
  `"mesh_2d_no_wrap"`) to the integer kind constants.
 - (Indirectly) IPCQ neighbor-table install:
  `configure_sfr_intercube_multisip` reads
  the module's `TOPO_NAME_TO_KIND` plus cube dimensions to decide the
  SFR.
 The current corpus has one algorithm module:
 `lrab_hierarchical_allreduce.py` (248 lines). The name expands to
 "**l**eft-**r**ight **a**lternating **b**roadcast hierarchical allreduce".
 When future modules like `ring_allreduce`, `tree_allreduce`, or
 `broadcast` are added, they must follow this contract for the backend's
 dispatch path to keep working.
 Without an ADR-level contract:
 - A new algorithm author has to infer the signature from ADR-0047 D3's
  one-liner.
 - The kernel-function argument order (especially `t_ptr, n_elem,
  cube_w, cube_h, n_sips, sip_rank, sip_topo_kind, sip_topo_w,
  sip_topo_h, tl`) is unclear without grep.
 - It is conventional, but not documented, what `kernel_args` takes as
  inputs and what tuple it must return.
 ## Decision
 ### D1. The algorithm module exposes four public symbols
 ```python
 # src/kernbench/ccl/algorithms/<name>.py
 from __future__ import annotations
 # (required) topology-kind constants — referenced internally
 SIP_TOPO_RING = 0
 SIP_TOPO_TORUS = 1
 SIP_TOPO_MESH = 2
 # (optional) topology name → kind mapping. Used by the backend to
 # translate ccl.yaml/topology's string SIP topology into an integer.
 TOPO_NAME_TO_KIND = {
    "ring_1d": SIP_TOPO_RING,
    "torus_2d": SIP_TOPO_TORUS,
    "mesh_2d_no_wrap": SIP_TOPO_MESH,
 }
 # (required) kernel argument builder
 def kernel_args(world_size: int, n_elem: int, *, cube_w: int = 4, cube_h: int = 4) -> tuple:
    return (n_elem, cube_w, cube_h, world_size)
 # (required) kernel function (TLContext is injected via the `tl=...`
 # keyword argument).
 def my_allreduce_kernel(t_ptr, n_elem, cube_w, cube_h, n_sips,
                         sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h, *, tl):
    ...
 # (required) kernel alias — the backend accesses `module.kernel`
 kernel = my_allreduce_kernel
 ```
 - The `kernel` alias is the entry point the backend invokes. Whatever
  the function name is (e.g., `allreduce_intercube_multidevice`), it
  must be exposed via `module.kernel = fn`.
 - Without `kernel_args`, the backend has no way to build the
  algorithm's argument list. See D2 for the signature.
 - If `TOPO_NAME_TO_KIND` is absent, the backend falls back to
  `sip_topo_kind = 0`. An algorithm supporting only a single topology
  may omit it.
 ### D2. `kernel_args` signature — `(world_size, n_elem, *, cube_w, cube_h)`
 ```python
 def kernel_args(world_size: int, n_elem: int, *,
                cube_w: int = 4, cube_h: int = 4) -> tuple:
    return (n_elem, cube_w, cube_h, world_size)
 ```
 - **Positional arguments**: `world_size` (= number of ranks), `n_elem`
  (= element count of a single shard, f16-based).
 - **Keyword arguments**: `cube_w`, `cube_h` (= cube-mesh dimensions).
  Default 4×4 — aligned with `topology.yaml`'s `sip.cube_mesh` default.
 - **Return**: a tuple in the order the kernel's positional arguments
  expect.
 When the backend calls `all_reduce`:
 ```python
 kernel_args_tuple = self._algo_module.kernel_args(
    self._world_size, n_elem, cube_w=eff_cube_w, cube_h=eff_cube_h,
 )
 extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)
 pending = self.ctx.launch(
    self._merged["algorithm"], kernel_fn, tensor,
    *kernel_args_tuple, *extra_args, _defer_wait=True,
 )
 ```
 So the kernel's full positional argument list becomes: `(tensor_ptr,
 *kernel_args_tuple, sip_rank, sip_topo_kind, sip_topo_w,
 sip_topo_h)`, with `tl=...` injected as a keyword. The tuple length
 and order returned by `kernel_args` must **match the kernel signature
 1:1**.
 ### D3. Kernel signature — standardized 9 + tl arguments
 Recommended signature:
 ```python
 def my_kernel(
    t_ptr: int,             # VA base of the row-wise-sharded tensor on this SIP
    n_elem: int,            # element count per cube tile (or per shard)
    cube_w: int,            # cube mesh width  (from kernel_args)
    cube_h: int,            # cube mesh height (from kernel_args)
    n_sips: int,            # equal to world_size (rank = SIP, ADR-0024)
    sip_rank: int,          # this SIP's rank
    sip_topo_kind: int,     # result of TOPO_NAME_TO_KIND lookup
    sip_topo_w: int,        # SIP mesh width (0 for ring_1d)
    sip_topo_h: int,        # SIP mesh height (0 for ring_1d)
    *, tl,                  # TLContext (auto-injected)
 ) -> None:
 ```
 Even if `kernel_args` chose a different positional argument order, the
 kernel's **last four positional arguments are always
 `(sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)`** — the backend
 appends them as `extra_args` (ADR-0047 D5). A custom algorithm must
 accept these four, but a single-SIP algorithm may simply ignore them.
 `tl` is injected via keyword — `RuntimeContext.launch` adds `tl=tl_ctx`
 just before invoking the kernel. The signature therefore exposes `tl`
 as keyword-only (`*, tl`) or as the trailing keyword parameter.
 ### D4. Kernel body — freedom and constraints
 Available inside the kernel: every `tl.*` primitive from ADR-0046 D3.
 Common patterns:
 - `cube_id = tl.program_id(axis=1)` — this PE's cube index.
 - `pe_addr = t_ptr + cube_id * nbytes` — per-cube VA of the tile.
 - `acc = tl.load(pe_addr, shape=(n_elem,), dtype="f16")` — load local
  data.
 - `tl.send(dir=...)` / `tl.recv(dir=..., shape=, dtype=)` — IPCQ
  collective.
 - `acc = acc + recv` — TensorHandle arithmetic operators (ADR-0046 D4).
 - `tl.store(pe_addr, acc)` — store the result.
 The kernel body is plain Python — branching and loops are fine. But:
 - No SimPy `yield` or `async` (ADR-0046 D1).
 - No direct access to TensorHandle `.data` — the Phase 1 timing model
  doesn't see data dependencies (ADR-0020's 2-pass separation).
 - Kernel execution must be deterministic — the same input must produce
  the same op sequence. No random or external IO.
 ### D5. SIP topology semantics — meaning of `sip_topo_kind`
 The backend looks up `topology.yaml`'s `system.sips.topology` string
 in the algorithm module's `TOPO_NAME_TO_KIND` and passes the integer
 as `sip_topo_kind`. The algorithm then branches:
 ```python
 if sip_topo_kind == SIP_TOPO_RING:
    acc = _inter_sip_ring(...)
 elif sip_topo_kind == SIP_TOPO_TORUS:
    acc = _inter_sip_torus_2d(...)
 elif sip_topo_kind == SIP_TOPO_MESH:
    acc = _inter_sip_mesh_2d(...)
 ```
 Each topology branch communicates with peers via IPCQ direction names
 (`"global_E"`, `"W"`, `"S"`, `"N"` …). Direction semantics are defined
 in ADR-0023/0025; `configure_sfr_intercube_multisip` installs the IPCQ
 neighbor table accordingly.
 If a topology kind not supported by the algorithm appears, prefer an
 explicit `raise ValueError(f"unsupported topology kind
 {sip_topo_kind}")` over a silent no-op — fail fast on misconfiguration.
 ### D6. The `ccl.yaml` algorithm entry
 The algorithm module is paired with a `ccl.yaml` entry (ADR-0023 D10 +
 ADR-0047 D3):
 ```yaml
 defaults:
  algorithm: lrab_hierarchical_allreduce
  n_elem: 8
 algorithms:
  lrab_hierarchical_allreduce:
    module: kernbench.ccl.algorithms.lrab_hierarchical_allreduce
    # optional: world_size override
    # optional: per-algorithm parameters consumed by configure_sfr_intercube_multisip
 ```
 - `module`: the full Python module path; `importlib.import_module`
  consumes this string as-is.
 - `world_size` (optional): when set, overrides the topology fallback
  (ADR-0047 D2).
 - Algorithm-specific parameters are consumed by
  `configure_sfr_intercube_multisip`.
 Workflow to add a new algorithm:
 1. Write `src/kernbench/ccl/algorithms/<name>.py` following D1.
 2. Add the entry under `algorithms` in `ccl.yaml`.
 3. (If needed) extend `kernbench.ccl.sfr_config` with the SFR-install
   branch.
 4. Add tests (e.g., `tests/sccl/test_<name>.py`, extending the
   ADR-0043 eval harness).
 ### D7. Legacy "rank = flat PE index" mode
 The `world_size` override in `ccl.yaml`, surfaced by ADR-0047 D2, is
 used by legacy "rank = flat PE index" tests. The algorithm module can
 assume `n_sips=world_size` ranks even in this mode — the backend
 maintains the rank↔(SIP, cube, PE) mapping, so no modal branching is
 needed inside the algorithm body.
 In single-cube workloads (where `cube_w=cube_h=1`), the algorithm must
 skip mesh-based phases — see the
 `single_cube = (cube_w == 1 and cube_h == 1)` pattern in
 `lrab_hierarchical_allreduce.py`.
 ## Alternatives Considered
 ### A1. Organize the algorithm module as a class (`class Allreduce: kernel(...)`)
 Rejected. The Python module namespace already identifies an algorithm
 (see ADR-0047 D3's `importlib.import_module`). A class wrapper adds
 indirection without simplifying dispatch. Module-level free functions
 plus a `kernel` alias are clean and obvious.
 ### A2. Type `kernel_args` with an explicit dataclass
 Rejected (currently). Each algorithm normally has a different argument
 count; forcing one dataclass would hurt cross-algorithm interchange.
 The tuple return is simple and unpacks cleanly with the backend's
 `*kernel_args_tuple`. If an algorithm wants stronger internal typing,
 it may define its own NamedTuple.
 ### A3. Move SFR installation inside the algorithm module
 Rejected. SFR installation
 (`configure_sfr_intercube_multisip`) is a cross-module decision
 combining topology + algorithm; `kernbench.ccl.sfr_config` is a more
 natural home than the algorithm module itself. D6's "extend
 sfr_config if needed" workflow keeps responsibility boundaries clear.
 ### A4. Auto-register algorithm names via a decorator (analogous to ADR-0045's `@bench`)
 Rejected. Unlike benches, algorithms are already tied to `ccl.yaml`
 entries; an additional registry would be redundant. The string mapping
 in `module` is sufficient.
 ## Consequences
 - ADR-0047 D3's one-line contract expands to a D1–D7 author-facing
  guide; new algorithm signatures no longer need to be grep-derived.
 - D3's standardized 9 + tl signature couples naturally with the
  backend's `extra_args` append (ADR-0047 D5). It is explicit that
  even single-SIP-only algorithms must accept the four `sip_*` trailing
  arguments.
 - D5's fail-loud recommendation means a `ccl.yaml` topology that the
  algorithm doesn't support will surface as an explicit `ValueError`
  rather than a silent wrong result.
 - D6's step-by-step addition workflow makes clear how far a new
  algorithm has to reach into sfr_config / tests / ccl.yaml.
@@ -0,0 +1,288 @@
 # ADR-0051: Routing Helper API — `AddressResolver` + `PathRouter`
 ## Status
 Accepted (2026-05-22).
 Pins down every public API, argument, return value, and adjacency-graph
 selection of the two helper classes (`AddressResolver`, `PathRouter`)
 exposed by `policy/routing/router.py`. ADR-0002 defines routing
 distance, ordering, and bypass rules, but **the helper API surface
 itself** has had no ADR-level coverage.
 ## First action
 ### `AddressResolver(graph)`
 On construction, caches two pieces of state:
 1. `self._node_ids = set(graph.nodes)` — a set of all node ids for
   lookup.
 2. `self._hbm_slice_bytes = hbm_total_gb * (1 << 30) // slices_per_cube`
   — derived from `graph.spec.cube.memory_map` (default `48 GB / 8
   slices = 6 GB`). `resolve()` uses this value to decode `pe_id` from
   an HBM PA's `hbm_offset`.
 In short, **AddressResolver's first act is "precompute the full set of
 node ids and the HBM slice size"**. It does not retain the graph
 itself.
 ### `PathRouter(graph)`
 On construction, **builds four separate adjacency graphs in one pass**:
 1. `self._adj_all`: every edge (used for component-to-component
   routing).
 2. `self._adj`: edges with `kind != "command"` (PE DMA / generic data
   paths).
 3. `self._adj_mcpu_dma`: excludes
   `_MCPU_DMA_EXCLUDE = {"pe_internal", "pe_to_router"}` (M_CPU DMA
   must not pass through PE pipeline nodes).
 4. `self._adj_local`: excludes the 8-element `_UCIE_KINDS` set (UCIe
   would look like a zero-distance bus to Dijkstra, which would prefer
   it over the mesh — for cube-local routing this must be avoided).
 Each graph is a `defaultdict(list)` of `(neighbor, weight)`. The
 weight is `edge.routing_weight_mm or edge.distance_mm`.
 In short, **PathRouter's first act is "classify topology edges into
 four policy-specific adjacency lists simultaneously"**. Each `find_*()`
 call picks the appropriate graph and runs Dijkstra.
 ## Context
 `policy/routing/router.py` performs two responsibilities together:
 - **Naming**: it is the sole owner of the topology naming convention
  (`sip{S}.cube{C}.<comp>`, `sip{S}.io{I}.pcie_ep`, etc.). Components /
  probe / IPCQ install / runtime API do not build node-id strings
  themselves — they call helpers.
 - **Path decisions**: policy separation by `edge.kind`. For the same
  src→dst, different routing intents (PE DMA vs M_CPU DMA vs general
  component routing) call for different adjacencies and so produce
  different paths.
 This helper API is widely consumed (probe.py / distributed.py /
 install.py / various components / tests), yet **the exact signatures /
 return semantics / adjacency picks** are not gathered in any ADR. This
 ADR closes that gap.
 ## Decision
 ### D1. `AddressResolver` exposes five public methods
 #### D1.1. `resolve(addr: PhysAddr) -> str`
 Translates a `PhysAddr` to a destination node id in the topology:
 ```
 addr.kind == "hbm"             → f"sip{s}.cube{d}.hbm_ctrl.pe{pe_id}"
  where pe_id = addr.hbm_offset // self._hbm_slice_bytes  (ADR-0017 D4/D9)
 addr.kind == "pe_resource":
  addr.unit_type == PE         → f"sip{s}.cube{d}.pe{addr.pe_id}.pe_tcm"
  addr.unit_type == SRAM       → f"sip{s}.cube{d}.sram"
  addr.unit_type == MCPU       → f"sip{s}.cube{d}.m_cpu"
  others                       → RoutingError("unsupported unit_type")
 other kinds                    → RoutingError("unsupported address kind")
 ```
 If the derived node id is not in `self._node_ids`, raises
 `RoutingError(f"node {node_id} not found in topology")`. So even when
 the address has valid syntax, an absent node in the topology
 fails-loud.
 #### D1.2. `find_m_cpu(sip, cube) -> str`
 Returns `f"sip{sip}.cube{cube}.m_cpu"`; absent → `RoutingError`.
 #### D1.3. `find_pcie_ep(sip, io_id="io0") -> str`
 Returns `f"sip{sip}.{io_id}.pcie_ep"`; absent → `RoutingError`.
 #### D1.4. `find_io_cpu(sip, io_id="io0") -> str`
 Returns `f"sip{sip}.{io_id}.io_cpu"`; absent → `RoutingError`.
 #### D1.5. `find_all_pcie_eps() -> list[str]`
 All PCIE_EP node ids across all SIPs, sorted. Filtered by
 `endswith(".pcie_ep")`. Cross-SIP IPCQ uses this when enumerating
 PCIE_EPs.
 This class is the sole owner of the naming convention
 (`sip{S}.cube{C}.<comp>`, `sip{S}.{io_id}.<comp>`) — ADR-0015 D4.
 The topology builder produces nodes with the same naming convention;
 components never build node-id strings directly — they go through
 these helpers.
 ### D2. `PathRouter`'s four adjacency graphs
 Constructed in one pass. `edge.kind` drives policy:
 | graph             | excluded edge kinds                                                                                                                   | use case                                       |
 |-------------------|--------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|
 | `_adj_all`        | (none)                                                                                                                               | M_CPU↔NOC command included, IO_CPU/M_CPU routes |
 | `_adj`            | `"command"`                                                                                                                          | PE DMA / generic data paths                     |
 | `_adj_mcpu_dma`   | `"pe_internal"`, `"pe_to_router"`                                                                                                    | M_CPU DMA (skips PE pipeline)                   |
 | `_adj_local`      | `_UCIE_KINDS` (`ucie_internal`, `ucie_conn_to_router`, `router_to_ucie_conn`, `ucie_conn_to_noc`, `noc_to_ucie_conn`, `ucie_mesh`, `io_to_cube`, `cube_to_io`) | same-cube routing (UCIe bus excluded) |
 Each graph is `dict[node_id, list[(neighbor, weight)]]` with weight =
 `edge.routing_weight_mm or edge.distance_mm`. Excluding command edges
 prevents them from influencing routing; isolating `_adj_local` keeps
 UCIe's "zero-distance bus" from out-competing the mesh — consistent
 with ADR-0017 D7's cross-PE-slice mesh-distance requirement.
 ### D3. `PathRouter` exposes six public methods (+ two backward-compat shims)
 #### D3.1. `find_path(src_pe: str, dst_node: str) -> list[str]`
 **PE DMA routing**. `src_pe` is a PE prefix (e.g.,
 `"sip0.cube0.pe0"`); the function auto-prepends `.pe_dma`, making the
 true start node `"sip0.cube0.pe0.pe_dma"`.
 Adjacency depends on cube-locality (`_same_cube`):
 - **Same-cube** (src and dst share `sip{S}.cube{C}.` prefix): uses
  `_adj_local`. Excluding UCIe lets cross-PE-slice access pay accurate
  mesh distance (ADR-0017 D7).
 - **Cross-cube**: uses `_adj`. UCIe naturally becomes the right choice
  for the cross-cube portion.
 #### D3.2. `find_path_with_distance(src_pe, dst_node) -> tuple[list[str], float]`
 Same adjacency policy as D3.1, but returns `(path, total_distance)`.
 Used by probe and analysis tools that need the distance metric.
 #### D3.3. `find_mcpu_dma_path(m_cpu_id: str, dst_hbm_id: str) -> list[str]`
 **M_CPU DMA path**. Same cube → `_adj_local` (stay within the mesh);
 different cube → `_adj_all` (cross via UCIe). The
 `_MCPU_DMA_EXCLUDE` set ensures PE-pipeline nodes never appear on
 M_CPU's routes.
 #### D3.4. `find_memory_path(src: str, dst: str) -> list[str]`
 Direct memory path like
 `pcie_ep → io_noc → cube → router mesh → hbm_ctrl`. Uses
 `_adj_mcpu_dma` to exclude `pe_internal` and `pe_to_router`, so
 host-issued reads/writes never leak into the PE pipeline. Probe
 (ADR-0049 D1's H2D/D2H cases) calls this directly.
 #### D3.5. `find_node_path(src: str, dst: str) -> list[str]`
 Generic routing between arbitrary nodes, **including command edges**
 (via `_adj_all`). IoCpuComponent / MCpuComponent use this when they
 need to route through M_CPU ↔ NOC command-kind links.
 #### D3.6. Backward-compat shims
 - `_dijkstra(start, goal) -> list[str]` — thin wrapper for
  `_run_dijkstra(self._adj, …)`.
 - `_dijkstra_with_dist(start, goal) -> tuple[list[str], float]` —
  distance-aware variant.
 Despite the underscore prefixes (suggesting internal API), existing
 tests call these directly. New code should prefer D3.1–D3.5; these two
 shims are deprecation candidates.
 ### D4. Dijkstra — single-source shortest path
 `_run_dijkstra_with_dist(adj, start, goal)`:
 - `heapq` priority queue.
 - `best: dict[node, distance]` — best known distance to each node.
 - `prev: dict[node, predecessor]` — for path reconstruction.
 - Edge weight = `routing_weight_mm or distance_mm`. The separation
  matters because UCIe (and a few others) declare an explicit
  `routing_weight_mm` distinct from physical `distance_mm`.
 `start == goal` short-circuits to `([start], 0.0)`. Unreachable target
 → `RoutingError(f"no path from {start} to {goal}")`.
 The algorithm is **deterministic**: identical graph + start/goal gives
 the same path, satisfying SPEC R1 ("routing MUST be deterministic").
 Tie-breaks follow `heapq`'s push order (Python list order is
 deterministic).
 ### D5. Single-owner principle for helper-API decisions
 The following decisions live only inside router.py:
 - Naming convention: `sip{S}.cube{C}.<comp>`,
  `sip{S}.{io_id}.<comp>`,
  `sip{S}.cube{C}.hbm_ctrl.pe{pe_id}`.
 - Adjacency policy: which edge kinds belong to which graph.
 - Algorithm for recovering PE id from an HBM slice size.
 - Dijkstra weight selection
  (`routing_weight_mm or distance_mm`).
 Breaking single ownership (e.g., a component starting to build
 `f"sip{s}..."` itself) would explode the blast radius of naming-
 convention changes. This aligns with ADR-0015 D4.
 ### D6. Consumers of the helper API
 Methods listed in this ADR are called from (current corpus):
 - `probes/probe.py` (ADR-0049): `find_pcie_ep`, `find_io_cpu`,
  `find_m_cpu`, `find_node_path`, `find_mcpu_dma_path`,
  `find_memory_path`, `find_path`, `resolve`.
 - `runtime_api/distributed.py` (ADR-0047): indirectly (engine-internal
  routing).
 - `ccl/install.py` (ADR-0023): `find_all_pcie_eps`, `resolve`.
 - `sim_engine/event_log.py`: like probe — `find_pcie_ep`,
  `find_memory_path`.
 - `components/builtin/m_cpu.py`, `components/builtin/io_cpu.py`:
  `find_node_path`, `find_mcpu_dma_path`.
 - Tests (test_routing.py, test_cross_sip_routing.py, …): most of
  D3.1–D3.5.
 When a new consumer arrives, D1/D3 act as a first-pass guide on
 whether an existing method matches the intent or a new one is needed.
 ## Alternatives Considered
 ### A1. One adjacency graph + per-call edge-kind filtering
 Rejected. Re-filtering the graph on every `find_*()` call hurts
 Dijkstra cache locality. Constructing four graphs in one pass (D2)
 has modest memory cost (edges ≤ a few × 10⁴), and selection happens
 in O(1) at call time.
 ### A2. Drive adjacency separation by separate edge metadata rather than `kind`
 Rejected. `edge.kind` is already assigned by the topology builder
 (ADR-0015 D4 + ADR-0017); a parallel metadata field would force
 synchronization between two systems.
 ### A3. Use BFS with uniform weights instead of Dijkstra
 Rejected. With per-edge `routing_weight_mm` (mesh link / UCIe /
 IO-internal), BFS minimizes hop count rather than total
 latency/distance. SPEC R1 + R2 require deterministic and accurate
 routing, which BFS does not deliver.
 ### A4. Express the helper API as module functions instead of classes
 Rejected. Each class
 (`AddressResolver`, `PathRouter`) maintains caches
 (`_node_ids`, `_hbm_slice_bytes`, four adjacency graphs) reused across
 many routing queries on the same graph. Module functions would have
 to rebuild state per call or go global, hurting safety and
 performance.
 ## Consequences
 - When components / probe / IPCQ install / runtime API all go through
  router.py helpers, a naming-convention change (e.g., `.io0.` →
  `.iochiplet0.`) is a one-file edit (D5).
 - D2's four-graph split is now ADR-locked, so when a new edge kind is
  added (e.g., a new inter-die UCIe-link kind), the right adjacency
  category is decided explicitly rather than by default.
 - D3.1's same-cube vs cross-cube branching (ADR-0017 D7) is explicit,
  so anyone changing routing knows which adjacency to touch.
 - D6's consumer list bounds PR-review scope for helper-API changes,
  and the backward-compat shims (D3.6) are flagged as deprecation
  candidates.
@@ -0,0 +1,371 @@
 # ADR-0052: OpLog + MemoryStore Schemas — sim_engine internals
 ## Status
 Accepted (2026-05-22).
 Pins down the `OpRecord` schema and the `record_start` / `record_end` /
 `record_copy` behavior in `sim_engine/op_log.py`, plus the
 (space, addr) namespace and read/write semantics of `MemoryStore` in
 `sim_engine/memory_store.py`. ADR-0020 (2-pass data execution) declares
 that these two facilities exist, but **the precise record fields and
 semantics** had no ADR-level coverage, and several recent ADRs
 (ADR-0046 D3.2's `tl.store` visibility, ADR-0023 D9's IPCQ copy
 record) depend on these semantics.
 ## First action
 ### `OpLogger(memory_store=None)`
 On construction, initialize three fields:
 1. `self._records: list[OpRecord] = []` — accumulated records.
 2. `self._pending: dict[int, dict] = {}` — partial records keyed by
   `id(msg)` (created at `record_start`, completed at `record_end`).
 3. `self._memory_store = memory_store` — optional MemoryStore
   reference. Used to capture math-op input snapshots and dma_write
   HBM-source snapshots.
 Records and pending are empty; the `record_*` calls accumulate data
 over time.
 ### `MemoryStore()`
 On construction, initialize a single field:
 `self._storage: dict[str, dict[int, np.ndarray]] = {}` — a two-level
 dict (`space → addr → ndarray`). Inner dicts are created lazily as new
 spaces appear.
 In short, **both facilities' first act is "set up an empty accumulator
 buffer plus a sparse, per-space dict"**. The first record / write
 fills the fields when it arrives.
 ## Context
 ADR-0020 D2/D5/D7 (2-pass data execution) declares:
 - During Phase 1 (timing), `ComponentBase._on_process_start/end` hooks
  call `OpLogger.record_start/end`, recording the time and metadata of
  every data op.
 - Phase 2 (data) replays the op log in `t_start` order to compute real
  data.
 - Data payloads live in `MemoryStore`, keyed by (space, addr).
 Subsequent ADRs (ADR-0023 D9's IPCQ atomic write, ADR-0027's Megatron
 TP scratch-overwrite avoidance, ADR-0046 D3.2's `tl.store` visibility)
 depend on op_log and MemoryStore behavior, but **the exact record
 fields / space names / snapshot timing** are only discoverable via
 source grep. This ADR codifies them.
 ## Decision
 ### D1. `OpRecord` schema — seven fields
 ```python
@dataclass
 class OpRecord:
    t_start: float
    t_end: float
    component_id: str
    op_kind: str               # "memory" | "gemm" | "math" | "unknown"
    op_name: str               # e.g. "dma_read", "gemm_f16", "exp",
                               #     "TileToken/DMA_READ", "composite_gemm",
                               #     "ipcq_copy"
    params: dict[str, Any]
    dependency_ids: list[int] = field(default_factory=list)
 ```
 - **`t_start` / `t_end`**: SimPy time (float ns). `t_start` is when the
  component begins the op; `t_end` is completion. Duration =
  `t_end - t_start`.
 - **`component_id`**: the node id where the op occurred (e.g.,
  `"sip0.cube0.pe0.pe_dma"`).
 - **`op_kind`**: one of four. Phase 2 DataExecutor branches on this.
 - **`op_name`**: a debug/analysis-friendly name. For a TileToken,
  expands to `"TileToken/{stage_type}"` (e.g.,
  `"TileToken/DMA_READ"`) to disambiguate stages.
 - **`params`**: op-specific metadata dict (see D3).
 - **`dependency_ids`**: currently unused (default `[]`). Reserved for
  future cross-op dependency tracking.
 ### D2. `OpLogger.records` — guaranteed `t_start` sort
 ```python
@property
 def records(self) -> list[OpRecord]:
    self._records.sort(key=lambda r: r.t_start)
    return self._records
 ```
 A stable sort by `t_start` runs on each access. Records with the same
 `t_start` preserve insertion order. Aligns with ADR-0020 D5's
 "t_start stable ordering" requirement.
 Phase 2 DataExecutor always accesses via the `records` property, so
 even when `record_end` calls arrive out of `t_start` order (e.g., a
 short op started later but finished earlier), the sequence handed to
 Phase 2 is consistent.
 ### D3. `params` schema per `op_name` (matrix from `_extract_op_info`)
 #### D3.1. `op_kind="memory", op_name="dma_read"` (DmaReadCmd)
 ```python
 {"src_addr": int, "nbytes": int, "handle_id": str}
 ```
 #### D3.2. `op_kind="memory", op_name="dma_write"` (DmaWriteCmd)
 ```python
 {
    "src_space": str,   # handle.space ("tcm"|"hbm"|"sram"), default "tcm"
    "src_addr": int,    # handle.addr
    "shape": tuple, "dtype": str,
    "dst_space": "hbm", # DmaWrite always targets HBM
    "dst_addr": int,
    "nbytes": int,
    "handle_id": str,
    # When src_space == "hbm" at record_end, a snapshot is added (D4)
    "snapshot": np.ndarray | None,
 }
 ```
 #### D3.3. `op_kind="gemm", op_name=f"gemm_{dtype_a}"` (GemmCmd)
 ```python
 {
    "src_a_addr": int, "src_b_addr": int, "dst_addr": int,
    "shape_a": tuple, "shape_b": tuple, "shape_out": tuple,
    "dtype_in": str, "dtype_out": str,
    "m": int, "k": int, "n": int,
    # ADR-0027: per-operand + output spaces preserved
    "src_a_space": str, "src_b_space": str, "dst_space": str,
 }
 ```
 #### D3.4. `op_kind="math", op_name=msg.op` (MathCmd; op = "exp", "sum", "add", "where", …)
 ```python
 {
    "input_addrs": list[int],   # addrs of input handles
    "input_shapes": list[tuple],
    "input_spaces": list[str],
    "input_dtypes": list[str],
    "dst_addr": int, "dst_space": str,
    "shape_out": tuple, "dtype": str,
    "axis": int | None,         # only meaningful for reductions
    # All inputs get snapshots at record_end (D4)
    "input_snapshots": list[np.ndarray | None],
 }
 ```
 #### D3.5. `op_kind="gemm" or "math", op_name=f"composite_{op}"` (CompositeCmd)
 ```python
 {
    "op": str,              # "gemm" | "math"
    "out_addr": int, "out_nbytes": int,
    # If op == "gemm", same fields as GemmCmd are added:
    "src_a_addr": int, "src_b_addr": int,
    "shape_a": tuple, "shape_b": tuple,
    "dtype_in": str, "dtype_out": str,
    "src_a_space": str, "src_b_space": str,
    "dst_space": "hbm", "dst_addr": int,  # = out_addr
 }
 ```
 If `op == "gemm"`, `op_kind = "gemm"`; otherwise `"math"`. An alias so
 Phase 2 replays composite-gemm on the same path as `GemmCmd`.
 #### D3.6. `op_kind="memory", op_name="ipcq_copy"` (record_copy path)
 ```python
 {
    "src_space": str, "src_addr": int,
    "dst_space": str, "dst_addr": int,
    "shape": tuple, "dtype": str, "nbytes": int,
    "snapshot": np.ndarray | None,   # passed by caller; if None, record_copy reads fresh
 }
 ```
 `PE_DMA._handle_ipcq_inbound` (ADR-0023 D9) emits this record so Phase
 2 can replay the IPCQ slot's inbound copy. It bypasses
 `record_start` / `record_end` and pushes directly via `record_copy()`.
 #### D3.7. `op_kind="unknown", op_name=type(msg).__name__`
 Fallback for messages `_extract_op_info` doesn't recognize. `params =
 {}`. If DataExecutor encounters this kind, it skips — Phase 2 replay
 is unaffected.
 ### D4. Snapshot capture timing
 When `OpLogger._memory_store` is set, `record_end` performs:
 - **Math op**: read every input
  (addr/shape/space/dtype) from `self._memory_store.read(...)` and
  attach an ndarray copy to `params["input_snapshots"]`. Read failure
  → `None`.
 - **`dma_write` op**: snapshot the source **only if `src_space ==
  "hbm"`** and attach to `params["snapshot"]`. TCM (PE scratch)
  sources are **deliberately skipped** — TCM is repopulated by Phase 2
  math/gemm replay, and a Phase-1-time snapshot would capture a
  previous kernel's stale value (ADR-0027 postmortem: TP gemm →
  all_reduce race).
 - **`ipcq_copy`**: the caller passes the in-flight snapshot via
  `snapshot=token.data`. If absent, `record_copy` attempts a fresh
  read from MemoryStore.
 Snapshots are taken with `.copy()` (fresh allocation), making them
 safe against later storage mutation. This is the foundation of
 ADR-0027's "cross-PE Phase 2 ordering" race-avoidance.
 When `memory_store` is `None` (Phase 1 timing-only mode), all
 snapshot steps are skipped. Only the timing portion of the record is
 preserved; data replay is unavailable.
 ### D5. TileToken handling — `record_start` captures stage info
 ADR-0014 D6's self-routing tile token (pipeline mode) may have already
 advanced its `stage_idx` by the time `record_end` runs (the TileToken
 caches the next stage's params as it moves to the next component).
 Therefore:
 `record_start` pre-saves the following in `pending[id(msg)]["snap"]`:
 ```python
 snap["stage_type"] = stage.stage_type.name        # "DMA_READ", "GEMM", ...
 snap["stage_params"] = dict(stage.params)         # copy of params at start time
 ```
 `record_end` retrieves this snap and merges into params:
 - Adds `params["stage_type"]` to final params.
 - Merges `stage_params` keys (keeps existing values if any).
 - If `op_name == "TileToken"`, rewrites it to
  `f"TileToken/{stage_type}"` (e.g., `"TileToken/DMA_READ"`),
  disambiguating different stages emitted by the same component.
 Thanks to this, DMA_READ vs DMA_WRITE, FETCH vs STORE coming from the
 same component (e.g., pe_dma) are distinguishable in reports.
 ### D6. `MemoryStore` — two-level (space, addr) dict
 ```python
 class MemoryStore:
    def __init__(self) -> None:
        self._storage: dict[str, dict[int, np.ndarray]] = {}
    def write(self, space, addr, data): self._storage[space][addr] = data
    def read(self, space, addr, shape=None, dtype=None) -> np.ndarray: ...
    def has(self, space, addr) -> bool: ...
    def snapshot(self) -> MemoryStore: ...
 ```
 #### D6.1. Space namespace
 A string key. Standard values:
 - `"hbm"`: HBM data (deploy_tensor + Phase 2 dma_write results).
 - `"tcm"`: PE-local TCM (Phase 2 math/gemm output).
 - `"sram"`: cube-level SRAM (ADR-0023 D9.7's IPCQ slot tier).
 Other spaces (e.g., `"reg"`) are allowed — `_storage` is a lazy dict
 that creates a new space when `write` first touches it.
 #### D6.2. Address keying
 `addr` is an integer. It may be a **physical address (PA) or a virtual
 address (VA)** — `MemoryStore` itself doesn't know address-space
 semantics; it just uses them as keys. Phase 1's `MemoryWriteMsg`
 writes both PA and VA
 (`_create_tensor` zero-inits at PA and at the VA base too); Phase 2
 reads/writes via the addresses captured by op_log.
 The caller decides `addr`'s meaning — `MemoryStore` provides only
 lookup.
 #### D6.3. read/write semantics — reference store (no copy)
 `write(space, addr, data)`: stores the ndarray reference. **No copy.**
 If the caller later mutates the same ndarray, the stored value
 changes.
 `read(space, addr, shape=None, dtype=None)`: returns the stored
 ndarray reference. If `shape`/`dtype` are provided:
 - `dtype != stored.dtype`: `arr.view(np_dtype)` reinterprets as a
  view (no copy).
 - `shape != stored.shape`: if `nbytes` matches, `arr.reshape(shape)`
  is a view.
 - `nbytes` mismatch → `ValueError`.
 To detach the data, the caller must call `arr.copy()`. ADR-0027's
 race-avoidance requires explicit `.copy()` in op_log snapshot steps
 for exactly this reason.
 #### D6.4. `has(space, addr) -> bool`
 Existence check; does not materialize data.
 #### D6.5. `snapshot() -> MemoryStore`
 Shallow copy. Creates a new instance of inner dicts but shares
 ndarray references. Used at Phase 2 init to fork from Phase 1's
 store, so Phase 2 mutations don't affect Phase 1's remaining
 consumers.
 ### D7. op_log assumes a single-threaded SimPy
 `OpLogger`'s `_records` and `_pending` are lock-free. SimPy is
 single-threaded, so nothing else can intrude between `record_start`
 and `record_end` for the same message.
 When multi-process kernbench (ADR-0047 D6) arrives, OpLogger must be
 split per process — one OpLogger instance cannot receive records from
 multiple processes.
 ## Alternatives Considered
 ### A1. Externalize op_log to SQLite / parquet
 Rejected (currently). The in-memory list minimizes Phase 1 → Phase 2
 hand-off latency. Externalization makes sense for long-running batch
 runs but adds overhead for the current single-run workload.
 ### A2. Capture snapshots at `record_start`
 Rejected. At `record_start`, inputs are often not yet populated (e.g.,
 a math op's input is the output of a just-issued previous op).
 `record_end` is the correct point.
 ### A3. Per-component MemoryStore
 Rejected. The (space, addr) key already disambiguates effectively, and
 splitting per component would complicate cross-PE IPCQ copy (ADR-0023
 D9), which needs access to both source and destination stores.
 ### A4. Explicit dependency edges in op_log
 Partially adopted. The `dependency_ids` field exists on `OpRecord` but
 is currently unused (D1). Phase 2 DataExecutor orders via `t_start` +
 a secondary sort (memory ops before math at the same `t_start`). When
 an explicit dependency graph is required, this field is the home.
 Current ordering rules are sufficient, so it remains unused.
 ## Consequences
 - ADR-0020's op_log / MemoryStore declarations are expanded into the
  concrete D1–D6 schemas, so writing/modifying Phase 2 DataExecutor
  doesn't need source-grep to learn field semantics.
 - D3's per-`op_name` params matrix makes adding new ops (e.g., a new
  reduction type) a question of branching in `_extract_op_info`.
 - D4's per-op snapshot policy (math = input snapshot, dma_write =
  HBM-only snapshot) is ADR-locked, so ADR-0027's race-avoidance
  decision won't silently regress on future refactors.
 - D6.3's reference-store semantics are explicit, putting mutation
  safety on the caller. ADR-0027's explicit `.copy()` pattern is
  justified.
 - D7's single-thread assumption is recorded, so multi-process
  kernbench (ADR-0047 D6's supersession candidate) will need OpLogger
  separation when introduced.
@@ -0,0 +1,351 @@
 # ADR-0053: Topology Builder + Visualizer Algorithms
 ## Status
 Accepted (2026-05-22).
 Pins down the key algorithmic choices of the topology compile and
 visualization pipeline jointly implemented by `topology/builder.py`,
 `topology/mesh_gen.py`, and `topology/visualizer.py` —
 placement-driven router attachment, mesh auto-layout, the source_hash
 cache, view projections, and SVG rendering. ADR-0006 defines the
 high-level intent of topology compilation (compiled topology, distance
 extraction, automatic diagram generation), but **which algorithms the
 builder actually uses** was only discoverable via source grep.
 ## First action
 When `resolve_topology(path_str)` is called, four steps run in order:
 1. **Path validation** (`builder.py::resolve_topology`):
   `Path(path_str).expanduser().resolve()`, existence check, file
   check. Failure → `FileNotFoundError` or `ValueError`.
 2. **YAML parsing** (`_read_spec`): `yaml.safe_load`. Parse errors
   yield a `ValueError` with line/column. Non-dict roots are
   rejected.
 3. **Auto-generate the mesh** (`mesh_gen.ensure_mesh_file`): create or
   reuse a `cube_mesh.yaml` next to the topology file. Cache hit on
   matching source_hash; miss triggers regeneration. This step decides
   the cube NoC's router grid and attachment information.
 4. **Compile the graph** (`_compile_graph`): system → IO chiplets →
   cubes → inter-cube edges → IO↔cube edges → system↔IO edges, then
   build four view projections (system, sip, cube, pe) and wrap into
   a `TopologyGraph`.
 In short, **topology compilation's first act is "read topology.yaml as
 a dict, create/validate cube_mesh.yaml in the same directory, then
 build the flat graph + 4-view projection in system → sip → cube → pe
 order"**.
 ## Context
 `topology/` package responsibilities:
 - **builder.py** (1207 lines): turns topology.yaml into a
  `TopologyGraph` (nodes + edges + 4 view projections).
 - **mesh_gen.py** (305 lines): auto-decides the cube NoC's router
  grid and PE/UCIe/M_CPU/SRAM attachment positions and caches them in
  `cube_mesh.yaml`.
 - **visualizer.py** (887 lines): generates four SVG diagrams (system /
  sip / cube / pe) from a `TopologyGraph`.
 ADR-0006 makes the high-level decision that "the result of topology
 compilation is the single source for distance metadata and diagram
 generation", but specific algorithms (e.g., placement-driven nearest-
 router attachment, the HBM exclusion zone, which fields in source_hash
 trigger regeneration) are not in any ADR.
 In particular, these decisions are absent at ADR level:
 - Why is mesh_gen cached in a separate file (`cube_mesh.yaml`)?
 - Which fields are in source_hash, and which changes force
  regeneration?
 - Why placement coordinates in mm rather than cube coordinates?
 - How are the HBM exclusion zone and UCIe N/S/E/W distribution
  decided inside the mesh?
 - What is the abstraction-level difference among the four view
  projections (system/sip/cube/pe)?
 This ADR captures these decisions in one place.
 ## Decision
 ### D1. Compile pipeline — six stages
 `_compile_graph(spec)`:
 1. **System nodes** (`_instantiate_system`): add system-level nodes
   like `fabric.switch0` and the host CPU.
 2. **Per-SIP loop** (`for sip_id in range(system.sips.count)`):
   - **IO chiplets** (`_instantiate_io_chiplets`): create pcie_ep /
     io_cpu / io_noc / io_ucie PHYs / conn nodes and their bidirectional
     internal edges.
   - **Cube instantiation** (`_instantiate_cube`): using
     cube_mesh.yaml's router grid, instantiate cube routers, PE
     sub-components (pe_cpu, pe_dma, pe_fetch_store, pe_gemm, pe_math,
     pe_mmu, pe_tcm, pe_scheduler, pe_ipcq), m_cpu, sram, hbm_ctrl,
     and their internal edges.
   - **Inter-cube edges** (`_add_inter_cube_edges`): the UCIe
     N/S/E/W mesh edges.
   - **IO ↔ cube edges** (`_add_io_to_cube_edges`): connect io_noc to
     each cube's edge UCIe phy.
 3. **Switch ↔ IO edges** (`_add_system_to_io_edges`): bidirectional
   edges between `fabric.switch0` and each SIP's `pcie_ep` (the
   cross-SIP IPCQ path of ADR-0038 D3 + ADR-0010).
 4. **Build four view projections**:
   - `_build_system_view(spec)` — Tray level: SIPs and the system
     switch.
   - `_build_sip_view(spec)` — inside one SIP: cube mesh + IO
     chiplet.
   - `_build_cube_view(spec)` — inside one cube: router grid + PE /
     M_CPU / SRAM / HBM_CTRL attachments.
   - `_build_pe_view(spec)` — inside one PE: nine sub-components +
     internal edges (pe_internal kind).
 5. **Return `TopologyGraph`**: `TopologyGraph(spec, nodes, edges,
   system_view, sip_view, cube_view, pe_view)`.
 The six stages are **ordered for a reason**: only after cubes exist
 do inter-cube edges have valid src/dst, and IO chiplets must precede
 the IO ↔ cube edges that reference them. New node types must slot in
 the right spot.
 ### D2. `cube_mesh.yaml` — a separate file with a source_hash cache
 `mesh_gen.ensure_mesh_file(cube_spec, mesh_path)`:
 1. Compute `source_hash = _compute_source_hash(cube_spec)` from these
   input fields:
   - `geometry` (cube_mm.w/h …).
   - `pe_layout` (corners, pe_per_corner).
   - `ucie.n_connections`.
   - `memory_map.hbm_mapping_mode`.
   - `placement` (m_cpu/sram pos_mm).
 2. If `mesh_path` (= `cube_mesh.yaml` next to topology.yaml) exists
   and `existing.source_hash == source_hash`, reuse it (cache hit).
 3. Otherwise, generate a new mesh via
   `_generate_mesh(cube_spec, source_hash)` and write to yaml.
 Caching as a separate file because:
 - Mesh generation involves nontrivial PE/UCIe/router attachment math
  and is too expensive to redo every time.
 - Multiple runs with the same cube spec must guarantee an identical
  mesh.
 - The resulting mesh is itself an inspectable / debuggable artifact.
 The five fields listed in source_hash are the ones that determine
 mesh shape; other changes (e.g., bandwidth, overhead_ns) do not
 trigger mesh regeneration.
 ### D3. Cube NoC mesh auto-layout
 `_generate_mesh(cube_spec)`:
 #### D3.1. Rows / columns
 - `pe_positions = _corner_pe_positions(cube_w, cube_h)`: PE-center
  coordinates (mm) per corner (NW/NE/SW/SE). Hardcoded patterns like
  `(1.5, 1.5)` and `(cube_w-1.5, cube_h-1.5)`; with `pe_per_corner=2`,
  each corner has two PE positions.
 - `col_xs = _compute_col_positions(...)`: union of PE x-coordinates,
  plus relay columns inserted when any gap exceeds
  `max_spacing = 3.0 mm`.
 - `row_ys, rows_per_half = _compute_row_positions(cube_h,
  n_connections, pe_positions)`:
  - `n_conn = max(n_connections, 2)` (hot-path minimum).
  - `rows_per_half = ceil(n_conn / 2)`.
  - Top half + two HBM rows + bottom half. HBM sits at
    `(cube_h/2 - 1.5, cube_h/2 + 1.5)`. The gap between PE rows and
    HBM rows is `hbm_gap = 1.5 mm`.
 #### D3.2. HBM exclusion zone
 `hbm_row_start = rows_per_half`,
 `hbm_row_end = rows_per_half + 1`.
 `hbm_col_start = n_cols // 2 - 1`,
 `hbm_col_end = n_cols // 2`.
 Router slots inside this (row, col) rectangle are marked `None` (no
 router). HBM controllers are added separately as
 `hbm_ctrl.pe{X}` nodes following ADR-0017 D9's per-PE partition
 pattern.
 #### D3.3. PE attachment
 Each corner's PEs map to a row:
 - Top half: NW → row 0, NE → row 1 (top_corners index).
 - Bottom half: SW → row `hbm_row_end + 1`, SE → row
  `hbm_row_end + 2`.
 Each PE's x-coordinate attaches to the nearest column's router
 (`min(range(n_cols), key=lambda c: abs(col_xs[c] - pe_x))`).
 Attachment items are `pe{pe_idx}.dma`, `pe{pe_idx}.cpu`,
 `pe{pe_idx}.hbm` (pushed into the router's attach list).
 #### D3.4. M_CPU / SRAM attachment — nearest router by Euclidean distance
 For `placement.m_cpu.pos_mm` (default `[1.5, 5.5]`) and
 `placement.sram.pos_mm` (default `[1.5, 8.5]`), find the router with
 the smallest Euclidean distance and append `"m_cpu"` / `"sram"` to
 its attach list.
 #### D3.5. UCIe N/S/E/W distribution
 `ucie_pe_rows = top_pe_rows + bot_pe_rows` (total
 `2 * rows_per_half`).
 - UCIe-E: one PE row at a time, attach `ucie_e.c{i}` to the rightmost
  column's router.
 - UCIe-W: attach `ucie_w.c{i}` to the leftmost column's router (E's
  mirror).
 - UCIe-N/S: split PE columns into left and right halves; attach to
  the top row's / bottom row's matching columns.
 Each UCIe connection is suffixed `c{i}`, distributing
 ucie_n_connections PHYs (ADR-0017 D5+).
 ### D4. Node naming convention — single ownership
 builder.py creates nodes with the following naming convention (the
 single-owner principle from ADR-0051 D5):
 - `fabric.switch0` — system-level switch.
 - `sip{S}.{io_id}.{pcie_ep|io_cpu|io_noc|io_ucie.{dir}|conn.{id}}` —
  IO chiplet.
 - `sip{S}.cube{C}.{m_cpu|sram|hbm_ctrl.pe{X}|noc.r{R}c{C}|...}` —
  inside cube.
 - `sip{S}.cube{C}.pe{P}.{pe_cpu|pe_dma|pe_fetch_store|pe_gemm|pe_math|pe_mmu|pe_tcm|pe_scheduler|pe_ipcq}` —
  PE sub-components.
 Changing this convention requires updating both builder.py and
 router.py's helpers (ADR-0051). Components never know the convention
 directly — they only call the helpers.
 ### D5. Edge `kind` classification
 Every edge gets a `kind`; routing policy (ADR-0051 D2) reads it. Major
 kinds:
 - `"pe_internal"` — within a PE between sub-components.
 - `"pe_to_router"` — PE_DMA ↔ cube NoC router.
 - `"router_mesh"` — between cube NoC routers.
 - `"router_to_hbm"`, `"router_to_mcpu"`, `"router_to_sram"`,
  `"sram_to_router"`, etc. — between cube-attached components.
 - `"ucie_internal"`, `"ucie_conn_to_router"`,
  `"router_to_ucie_conn"`, `"ucie_conn_to_noc"`,
  `"noc_to_ucie_conn"`, `"ucie_mesh"` — UCIe-related.
 - `"io_internal"` — inside IO chiplet.
 - `"io_to_cube"`, `"cube_to_io"` — at the IO ↔ cube boundary.
 - `"pcie"` — switch ↔ pcie_ep.
 - `"command"` — control-plane edges only (e.g., M_CPU ↔ NOC; excluded
  from PE DMA paths).
 Adding a new edge kind requires picking a category in router.py's
 four adjacency graphs (ADR-0051 D2). If you forget, it defaults to
 `_adj_all` only, which can produce unintended routes.
 ### D6. View projection — four abstraction levels
 `TopologyGraph` keeps four view projections alongside the flat
 nodes+edges:
 - **system_view** (`_build_system_view`): Tray level. SIP blocks and
  `fabric.switch0`. PCIe links shown. For external high-level
  overview.
 - **sip_view** (`_build_sip_view`): inside one SIP — cube mesh + IO
  chiplet (pcie_ep + io_cpu + io_noc). UCIe N/S/E/W appear as
  cube-cube links.
 - **cube_view** (`_build_cube_view`): inside one cube — router grid +
  PE / M_CPU / SRAM / HBM_CTRL attachments + UCIe PHY edges. For
  intra-cube routing / placement debugging.
 - **pe_view** (`_build_pe_view`): inside one PE — nine sub-components
  + internal edges (pe_internal kind). For detailed PE-internal
  dataflow review.
 Views are selectively rendered via the spec's
 `visualization.emit_views: [system, sip, cube]` (ADR-0006). The pe
 view is omitted from default output but the code is retained for
 detailed debugging.
 ### D7. visualizer.py — SVG diagram output
 `emit_diagrams(graph, out_dir)` renders every view as SVG. Key
 functions:
 - `_render_view_svg(view)` — generic view render (no router grid).
 - `_render_cube_view_svg(view, spec)` — cube-view specific (HBM block,
  router grid layout, PE/M_CPU/SRAM/HBM placement).
 - `_draw_node`, `_draw_edge` — node/edge visual representation.
 - `_pick_scale`, `_compute_node_sizes` — auto-scaling.
 The visualizer is a **derived artifact** (ADR-0006); changes here do
 not pass production checks. Aligns with CLAUDE.md's "Derived
 Artifacts" guidance.
 ### D8. Blast radius of spec changes
 | spec field                            | effect              | mesh regenerated? |
 |---------------------------------------|---------------------|-------------------|
 | `system.sips.count`                   | SIP count, node count | No                |
 | `sip.cube_mesh.w/h`                   | cube mesh shape     | No                |
 | `cube.geometry.cube_mm.w/h`           | cube size (mm)      | **Yes**           |
 | `cube.pe_layout.corners/pe_per_corner`| PE attachment positions | **Yes**       |
 | `cube.ucie.n_connections`             | UCIe PHY distribution | **Yes**         |
 | `cube.memory_map.hbm_mapping_mode`    | HBM distribution mode | **Yes**         |
 | `cube.placement`                      | M_CPU/SRAM positions | **Yes**          |
 | `cube.memory_map.*` (besides above)   | HBM capacity / BW   | No                |
 | `*.links.*.bw_gbs`                    | edge bandwidth      | No                |
 | `*.attrs.overhead_ns`                 | component latency   | No                |
 The table mirrors D2's `_compute_source_hash` inputs. Changes that
 require mesh regeneration automatically invalidate `cube_mesh.yaml`'s
 source_hash.
 ## Alternatives Considered
 ### A1. Regenerate the mesh on every compile without a cache file
 Rejected. The cost of mesh generation would be paid repeatedly (CLI
 runs, probe, tests) for the same spec, and the human-inspectable
 artifact would disappear.
 ### A2. Merge mesh generation into builder.py
 Rejected (currently). It is a 305-line algorithm of its own, and the
 mesh-layout decisions (placement-driven router attachment, HBM
 exclusion zone) are different from builder's general node/edge
 emission. Keeping it separate respects single-responsibility.
 ### A3. Express placement coordinates in cube coordinates (col/row)
 Rejected. mm coordinates flow consistently between the visualizer and
 mesh layout (for nearest-router computation). Cube coordinates are
 undefined until the router grid is fixed, so they are unsuitable as
 placement input.
 ### A4. Lazy view projection generation
 Rejected (currently). The four views are cheap to build (typically <
 100 ms), and eager construction guarantees `TopologyGraph` as the
 single source of truth.
 ### A5. Visualizer output in formats besides SVG (PNG/PDF)
 Rejected. SVG is vector + text-searchable + directly renderable in
 browsers. PNG conversion, when required, is downstream
 post-processing (e.g., rsvg-convert).
 ## Consequences
 - ADR-0006's high-level intent is fleshed out via D1–D7; topology
  changes can be assessed quickly via D8's table.
 - D3's mesh-layout algorithm is ADR-locked, so future PE attachment
  patterns (e.g., a 6-zone HBM split) make clear which stage they
  affect.
 - D5's edge-kind list and D7's view structure are explicit, giving PR
  reviewers a quick map of where (builder + router + visualizer) a
  new component type ripples through.
 - D2's source_hash invalidation rules are explicit, so a stale
  `cube_mesh.yaml` (e.g., when only bandwidth changed) is recognized
  as correct behavior.
@@ -0,0 +1,143 @@
 # ADR-0054: Milestone Eval Benches — self-contained sweep + figure benches
 ## Status
 Accepted (2026-05-22).
 Amends ADR-0044 (D1/D2) and ADR-0045 (D5) and supersedes the "logic lives
 in `scripts/` + `tests/`" arrangement of ADR-0043/0044: the GEMM and
 allreduce evaluation harnesses are now self-contained **benches** that a
 user runs to regenerate every result + figure.
 ## Context
 ADR-0043 (allreduce eval) and ADR-0044 (GEMM eval) split each harness into
 a **sweep** (a manual `scripts/` driver, or — for allreduce — the
 parametrized tests themselves) plus **figure tests** that render committed
 data. The sweep/render logic therefore lived under `scripts/gemm_sweep.py`,
 `tests/gemm/_gemm_plot_helpers.py`, and `tests/sccl/_allreduce_helpers.py`.
 A milestone requirement ("refactor allreduce + GEMM evaluation so a user
 can run *one bench* to generate all the results and plots") cannot be met
 by that layout: a bench is production code and **must not import from
 `tests/`** (ADR-0007 layer direction). The eval logic had to move into
 production, reachable from a bench.
 The chosen home is the bench module itself — not a separate
 `kernbench.eval` package. A bench file may contain arbitrary module-level
 code; collapsing the harness into the bench keeps one file per domain and
 avoids an extra package layer.
 ## Decision
 ### D1. Two milestone benches own the eval logic
 - `src/kernbench/benches/milestone_1h_gemm.py` — GEMM shape×variant sweep +
  the three figure renderers (moved from `scripts/gemm_sweep.py` +
  `tests/gemm/_gemm_plot_helpers.py`).
 - `src/kernbench/benches/milestone_1h_ccl.py` — the distributed allreduce
  driver, latency + buffer-kind sweeps, topology diagram, FSIM comparison,
  and the direct-launch parity reference (moved from
  `tests/sccl/_allreduce_helpers.py`).
 Each file is the **single home** for its domain's eval logic.
 ### D2. The "eval bench" pattern (extends ADR-0045 D5)
 ADR-0045 D5 fixed a bench to a single configuration (single-SIP, or the
 ADR-0024 multi-SIP CCL exception). This ADR adds a third pattern:
 - An **eval bench** may drive *many* configurations and render figures. It
  builds its own per-config `GraphEngine` / `RuntimeContext` instances
  (one per sweep point) rather than using the outer `run_bench` engine.
 - Because the outer ctx then has no submitted handles, the bench submits a
  **sentinel tensor** (`torch.zeros((1, 1), …)`) at the end to satisfy
  `run_bench`'s "must submit at least one request" contract (ADR-0045 D4),
  so the CLI exits 0.
 ### D3. Output location
 Both benches write to `src/kernbench/benches/1H_milestone_output/{gemm,ccl}/`
 (per user request — artifacts beside the bench). The directory holds only
 generated PNG/CSV/JSON (never a `.py`/`__init__.py`), so the eager-import
 audit (ADR-0045 first action) ignores it — `pkgutil.iter_modules` does not
 yield non-package subdirectories. It is **committed** (like the
 `docs/diagrams/` artifacts) so the figures are viewable on the remote;
 rerunning the bench regenerates it in place.
 ### D4. GEMM heavy sweep — fresh by default, `MILESTONE_FAST` to reuse
 `milestone-1h-gemm` runs the full 24-sim sweep by default (minutes; one
 shape is 2048 tiles). `MILESTONE_FAST=1` reuses the committed
 `docs/diagrams/gemm_sweep.json` and only re-renders (seconds). This
 reverses ADR-0044 D1/D2's "heavy sweep stays a manual/`slow`-marked step":
 running the bench *is* the regeneration. The slow path is exercised by a
 `@pytest.mark.slow` bench test; the fast path runs by default.
 ### D5. Tests + script reuse via thin re-export shims (single home kept)
 The pre-existing figure tests and the `scripts/gemm_sweep.py` entry point
 are retained and now reuse the bench modules:
 - `tests/gemm/_gemm_plot_helpers.py` → re-exports the renderers +
  `GEMM_SWEEP_JSON`/`GEMM_PLOTS_DIR`/`ROOT` from
  `kernbench.benches.milestone_1h_gemm`.
 - `tests/sccl/_allreduce_helpers.py` → re-exports the driver core, config
  writers, sweep constants, renderers, and disk aggregators from
  `kernbench.benches.milestone_1h_ccl`, and keeps the **pytest-only** pieces
  local: the `pytest.param` matrices (`CONFIGS` / `_sweep_params` /
  `_bk_params`) and the fixture-coupled `_run_distributed`
  (`monkeypatch.chdir` + `_drive_distributed`) wrapper.
 - `scripts/gemm_sweep.py` → thin wrapper over the bench's `run_sweep`.
 Tests importing a bench module is permitted (tests sit above production,
 ADR-0007); it triggers the whole-package eager audit, which already runs on
 every `kernbench` invocation. matplotlib stays lazily imported inside the
 renderers, so the audit's startup cost is unchanged.
 ### D6. Flat module naming (no `benches/` subfolder)
 A `benches/` subpackage named `1H_milestone…` is impossible — a Python
 package name cannot start with a digit. The benches are therefore flat
 modules `milestone_1h_gemm.py` / `milestone_1h_ccl.py` with bench names
 `milestone-1h-gemm` / `milestone-1h-ccl` (kebab-case, letter-first per
 ADR-0045 D1).
 ## Consequences
 ### Positive
 - `kernbench run --bench milestone-1h-gemm` (or `…-ccl`) regenerates all of
  a domain's results + figures in one command — the milestone requirement.
 - Single source for the eval logic (the bench), reused by tests and the
  script via shims; no duplication.
 - The figure tests and `scripts/gemm_sweep.py` keep working unchanged.
 ### Negative / limitations
 - The two bench files are large (the CCL one mixes the distributed driver,
  sweeps, and matplotlib drawing). A "bench" that is mostly an eval harness
  is unusual; this ADR legitimizes it.
 - Generated artifacts live inside the source tree (`src/kernbench/benches/`)
  by explicit request and are committed (so the figures are viewable on the
  remote); rerunning the bench regenerates them.
 - `milestone-1h-ccl` (and the default `milestone-1h-gemm`) take minutes —
  acceptable for an on-demand milestone artifact, not for routine runs.
 ## Dependencies
 - **ADR-0007**: layer direction (why tests may import production but a bench
  may not import tests).
 - **ADR-0043 / ADR-0044**: the allreduce / GEMM eval harnesses this ADR
  relocates into benches.
 - **ADR-0045**: bench module contract; D2 here extends its D5 (single-device
  rule) with the eval-bench pattern, and relies on D4 (NO_REQUESTS) for the
  sentinel.
 - **ADR-0024**: rank = SIP launcher driven by the allreduce sweeps.
 ## Open questions
 - Should the GEMM theoretical-model constants (ADR-0044 D5) be sourced from
  ADR-0033/0014 rather than copied? Unchanged by this ADR.
 - Should `build_overview_slides.py` consume the milestone output PNGs
  instead of drawing GEMM bars natively? Still open (ADR-0044 D6 / Negative).
@@ -0,0 +1,175 @@
 # ADR Index
 Auto-generated by `tools/generate_adr_index.py`. Total ADRs: **47**.
 Classification mirrors the `/report` skill's section assignment. When adding a new ADR, also add an entry to the `CLASSIFICATION` table in `tools/generate_adr_index.py`.
 ## Design Principles
 - [ADR-0013](./ADR-0013-ver-verification-strategy.md) — Verification Strategy and Phase 1 Test Plan
 - [ADR-0033](./ADR-0033-lat-latency-model-assumptions.md) — Latency Model: Assumptions and Known Simplifications
 ## High-level Architecture
 - [ADR-0003](./ADR-0003-dev-target-system-hierarchy.md) — Target System Hierarchy & Modeling Scope  _(System hierarchy (Tray / SIP / CUBE / PE))_
 - [ADR-0007](./ADR-0007-api-runtime-api-boundaries.md) — Runtime API and Simulation Engine Boundaries  _(Runtime API ↔ sim_engine boundaries)_
 - [ADR-0016](./ADR-0016-dev-iochiplet-noc-and-memory-path.md) — IOChiplet NOC and Memory Data Path  _(IOChiplet NOC and memory data path)_
 - [ADR-0017](./ADR-0017-dev-cube-noc-and-hbm-connectivity.md) — Cube NOC and HBM Connectivity  _(Cube NOC and HBM connectivity)_
 ## Detailed Architecture
 One subsection per component file under `src/kernbench/components/builtin/`.
 ### forwarding
 - [ADR-0037](./ADR-0037-dev-forwarding-component.md) — Forwarding Component (forwarding_v1)
 ### hbm_ctrl
 - [ADR-0034](./ADR-0034-dev-hbm-controller-internal-design.md) — HBM Controller Internal Design
 ### io_cpu
 - [ADR-0036](./ADR-0036-dev-io-cpu-component-model.md) — IO_CPU Component Model
 ### m_cpu
 - [ADR-0035](./ADR-0035-dev-m-cpu-and-m-cpu-dma-component-model.md) — M_CPU and M_CPU.DMA Component Model
 ### pcie_ep
 - [ADR-0038](./ADR-0038-dev-pcie-ep-component-model.md) — PCIE_EP Component Model
 ### pe_cpu
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
 ### pe_dma
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
 - [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
 ### pe_fetch_store
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
 ### pe_gemm
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
 ### pe_ipcq
 - [ADR-0023](./ADR-0023-dev-ipcq-pe-collective.md) — PE-level IPCQ — Inter-PE Collective Communication
 ### pe_math
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
 ### pe_mmu
 - [ADR-0039](./ADR-0039-dev-pe-mmu-component-model.md) — PE_MMU Component Model — Component + Utility Dual Role
 ### pe_scheduler
 - [ADR-0014](./ADR-0014-dev-pe-pipeline-execution-model.md) — PE Pipeline Execution Model
 ### pe_tcm
 - [ADR-0040](./ADR-0040-dev-pe-tcm-component-model.md) — PE_TCM Component Model — Dual-Channel BW Serialization
 ### sram
 - [ADR-0041](./ADR-0041-dev-cube-sram-component-model.md) — Cube SRAM Component Model — terminal scratchpad on cube NoC
 ### tiling
 - [ADR-0042](./ADR-0042-prog-tile-plan-generators.md) — Tile Plan Generators — GEMM/Math Pipeline Plan Builders
 ## Implementation Decisions
 ### Address Scheme
 - [ADR-0001](./ADR-0001-mem-physaddr-layout.md) — 51-bit Physical Address Layout & Decoding Contract
 - [ADR-0011](./ADR-0011-mem-memory-addressing-simplification.md) — Memory Addressing — PA / VA / LA Address Models
 ### Routing & Helper API
 - [ADR-0002](./ADR-0002-lat-routing-distance.md) — Routing Distance, Ordering & Bypass Rules
 - [ADR-0051](./ADR-0051-lat-routing-helper-api.md) — Routing Helper API — `AddressResolver` + `PathRouter`
 ### Memory Semantics & Local-HBM Bandwidth
 - [ADR-0004](./ADR-0004-mem-memory-semantics-local-hbm.md) — Memory Semantics & Local-HBM Bandwidth Guarantee
 ### Topology Compilation, Diagrams & Builder Algorithms
 - [ADR-0005](./ADR-0005-dev-diagram-views-distance-layout.md) — Diagram Views & Distance-Aware Layout Rules
 - [ADR-0006](./ADR-0006-dev-topology-compilation-distance-diagram.md) — Topology Compilation, Distance Extraction, and Automatic Diagram Generation
 - [ADR-0053](./ADR-0053-dev-topology-builder-algorithms.md) — Topology Builder + Visualizer Algorithms
 ### Tensor Deployment and Allocation
 - [ADR-0008](./ADR-0008-api-tensor-deploy-and-allocation.md) — Tensor Deployment and Allocation (Host Allocator, PA-first)
 ### Kernel Execution and Host-Device Messaging
 - [ADR-0009](./ADR-0009-api-kernel-execution-messaging.md) — Kernel Execution Messaging and Completion Semantics
 - [ADR-0012](./ADR-0012-api-host-io-message-schema.md) — Host ↔ IO_CPU Message Schema (PA-first, PE-tagged)
 ### CLI Surface and Semantics
 - [ADR-0010](./ADR-0010-api-cli-surface-and-semantics.md) — Command Line Interface and Execution Semantics
 ### Component Port/Wire Fabric Model
 - [ADR-0015](./ADR-0015-dev-component-port-wire-model.md) — Component Port/Wire Model and Fabric Routing
 ### Two-Pass Data Execution
 - [ADR-0020](./ADR-0020-prog-data-execution-two-pass.md) — 2-Pass Data Execution Model (Timing / Data Separation)
 ### 2D Grid Program Identity
 - [ADR-0022](./ADR-0022-prog-program-id-2d-grid.md) — 2D Grid program_id Semantics
 ### Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)
 - [ADR-0024](./ADR-0024-par-sip-tp-launcher.md) — SIP-level Launcher — rank = SIP
 - [ADR-0026](./ADR-0026-par-dppolicy-intra-device.md) — DPPolicy = Intra-Device Only — remove sip/num_sips fields
 - [ADR-0027](./ADR-0027-par-megatron-tp.md) — Megatron-style Tensor Parallelism API
 - [ADR-0047](./ADR-0047-par-ahbm-ccl-backend.md) — AHBM CCL Backend — `torch.distributed`-compat shim
 - [ADR-0050](./ADR-0050-par-ccl-algorithm-module-contract.md) — CCL Algorithm Module Contract — `ccl/algorithms/*.py`
 ### IPCQ Direction Addressing
 - [ADR-0025](./ADR-0025-algo-ipcq-direction-addressing.md) — IPCQ Direction Addressing — address-based matching
 ### Intercube All-Reduce
 - [ADR-0032](./ADR-0032-algo-intercube-allreduce.md) — Intercube All-Reduce — pe0 cube-mesh reduce + multi-SIP exchange
 ### Evaluation Harnesses
 - [ADR-0043](./ADR-0043-eval-allreduce-harness.md) — Allreduce Evaluation Harness — `tests/sccl/`
 - [ADR-0044](./ADR-0044-eval-gemm-harness.md) — GEMM Evaluation Harness — `scripts/gemm_sweep.py` + `tests/gemm/`
 - [ADR-0054](./ADR-0054-eval-milestone-benches.md) — Milestone Eval Benches — self-contained sweep + figure benches
 ### Bench Module Contract
 - [ADR-0045](./ADR-0045-prog-bench-module-contract.md) — Bench Module Contract — registration, dispatch, and authoring
 ### Kernel-side tl.* API (TLContext)
 - [ADR-0046](./ADR-0046-prog-tl-context-contract.md) — TLContext — Kernel-side `tl.*` API Contract
 ### Memory Allocator Algorithms
 - [ADR-0048](./ADR-0048-mem-allocator-algorithms.md) — Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
 ### Probe Subcommand
 - [ADR-0049](./ADR-0049-ver-probe-subcommand.md) — `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
 ### Sim-engine Op Log and Memory Store Schemas
 - [ADR-0052](./ADR-0052-dev-oplog-memory-store-schemas.md) — OpLog + MemoryStore Schemas — sim_engine internals
@@ -0,0 +1,836 @@
 # KernBench — Architecture Design Document
 *2026 1H*
 KernBench is a system-level, discrete-event simulator for AI-accelerator
 chiplet systems. It models the data-movement and control paths across
 the full hardware hierarchy and reports end-to-end execution latency
 for kernels dispatched to the device's compute units.
 This document is a public summary of the architecture as designed and
 implemented in the first half of 2026. It assumes no prior knowledge of
 the simulator's internal documents; terms specific to the system are
 defined on first use.
 ---
 ## Design Principles
 KernBench is grounded in two foundational commitments: every measured
 latency must trace to explicit, modeled events on the simulator's graph,
 and every behavioral claim must be verifiable through tests that target
 spec-level invariants rather than incidental implementation details.
 <!-- src: ADR-0013 Context, Decision -->
 The verification posture is verification-driven. Tests are written to
 validate the architectural contracts that the simulator exposes —
 correct routing, deterministic results, monotonic latency under
 increasing hop counts — rather than to mirror the call graph of the
 implementation. Two phases coexist: a fast timing phase that exercises
 the simulator's discrete-event engine and produces a log of operations
 with timestamps, and an optional data-replay phase that uses that log
 to compute real numerical results. Tests can target either phase.
 <!-- src: ADR-0033 Context, Decision -->
 The latency model is intentionally abstract rather than
 cycle-accurate. Each modeled node contributes a configurable per-node
 overhead, each link contributes wire delay plus byte-over-bandwidth
 serialization, and each terminal service contributes its own service
 time. The simulator does not attempt to reproduce cache coherence
 protocols, microarchitectural pipelines, or full PCIe/UCIe protocol
 correctness; those are explicitly outside the scope. The aim is a
 simulator that compares system-level configurations meaningfully and
 deterministically, not one that ships microarchitectural truths.
 <!-- src: ADR-0033 Decision, Consequences -->
 Determinism is a hard requirement. Given identical inputs — topology,
 routing policy, and request stream — the simulator must produce
 identical outputs, hop traces included. This rules out reliance on
 unordered set iteration on the critical path and forces every latency
 contribution to come from an explicitly scheduled event on a modeled
 component or link. There are no implicit waits, no hardcoded magic
 delays, and no shortcuts that bypass the modeled graph.
 ---
 ## High-level Architecture
 <!-- src: ADR-0003 Context, Decision -->
 The simulated system is a four-level hierarchy. A **Tray** holds one or
 more **SIPs** (system-in-package), each containing a 2D mesh of
 **CUBEs** plus one or more **IO chiplets** that connect the SIP to the
 host. Each CUBE contains a regular grid of **PEs** (processing
 elements) plus its own attached resources — high-bandwidth memory
 (HBM), a per-cube SRAM scratchpad, and a management CPU (M_CPU). The PE
 itself is a composite of nine sub-components rather than a monolithic
 core. This hierarchy is fixed; the parameters along each axis (counts,
 mesh dimensions, link widths) are configurable through the topology
 spec.
 <!-- src: ADR-0007 Context, Decision -->
 A clean separation runs along the request flow. A **runtime API** at
 the top is the host-facing surface; it exposes tensor and kernel
 operations, owns host-side allocation metadata, and is topology-
 agnostic — it does not route or fan out. Below it the **simulation
 engine** decomposes runtime operations into discrete graph requests
 (memory writes, memory reads, kernel launches, MMU map installs) and
 schedules events deterministically. At the bottom, **components** model
 device behavior on a graph of nodes connected by links; they
 implement the actual latency contributions and pass requests along.
 No component reaches up into the runtime API, and no runtime call
 shortcuts the engine.
 <!-- DIAGRAM: Four-level system hierarchy — Tray containing SIPs, each SIP showing its 2D cube mesh and IO chiplet; one cube blown up to show the router mesh, attached PEs, M_CPU, SRAM, and HBM partition. -->
 ### Tray
 <!-- src: ADR-0003 Decision -->
 The Tray is the outermost boundary. It owns the host CPU on one side
 and one or more SIPs on the other, connected through a fabric switch.
 For collective communication that must traverse multiple SIPs, the
 fabric switch acts as the common rendezvous: device-side outbound
 traffic from one SIP routes through the switch and back into the
 target SIP's IO chiplet.
 ### SIP
 <!-- src: ADR-0003 Decision, ADR-0017 Context -->
 A SIP packages a 2D mesh of CUBEs and one or more IO chiplets. The
 default topology used by the simulator is a 4×4 cube mesh; the
 mesh dimensions are configurable. Each cube on the boundary of the
 mesh connects to its neighbors over UCIe (die-to-die) links arranged
 on the four cardinal sides — north, south, east, and west. The IO
 chiplets sit on one side of the SIP and provide the bridge to the host
 across PCIe.
 <!-- src: ADR-0016 Context, Decision -->
 The IO chiplet itself contains its own internal network. A
 host-facing PCIe endpoint passes traffic to a small NOC ("network on
 chip"); from there it can branch to a control-plane CPU that processes
 kernel-launch messages, or it can take the direct memory data path to
 the cube's HBM controller. The decision to provide a direct memory
 path that bypasses the control CPU was a deliberate concession to
 keep host-issued memory writes from paying control-plane overhead on
 the data path.
 ### CUBE
 <!-- src: ADR-0017 Decision -->
 Each CUBE owns a 2D mesh of NOC routers and a set of attached
 resources: PEs, the cube-local SRAM scratchpad, the management CPU
 (M_CPU), and the HBM partition (split across multiple PE-private
 slices for bandwidth). The router mesh uses deterministic XY routing.
 Attached components do not connect to each other directly — they all
 sit on the router mesh, and every cube-internal transfer pays the
 mesh distance from source to destination.
 <!-- src: ADR-0017 Decision -->
 The HBM partition is per-PE: each PE owns one HBM slice, and the
 controller exposes per-PE channels so that the same PE always
 addresses the same set of HBM channels. This makes the local-HBM
 bandwidth from a PE to its own slice predictable, while accesses to
 another PE's slice — or a different cube's slice — pay the mesh
 distance and any UCIe crossings.
 ### PE
 <!-- src: ADR-0014 Context, Decision -->
 A PE is not a monolithic core. Internally it is a set of nine
 sub-components, each modeling one stage of a request's flow: a small
 control CPU, a tile-pipeline scheduler, a DMA engine, a fetch-store
 engine that moves data between the on-PE scratchpad and the register
 file, a GEMM compute engine, a math compute engine, the tightly-
 coupled memory (TCM, the on-PE scratchpad), an MMU for virtual-to-
 physical address translation, and an inter-PE collective queue
 (IPCQ). The scheduler decomposes higher-level operations into per-tile
 stage sequences, and tile tokens self-route from one sub-component
 to the next.
 <!-- DIAGRAM: PE internal layout — the nine sub-components and the edges that connect them; tile token flowing through DMA_READ → FETCH → GEMM → STORE → DMA_WRITE. -->
 ---
 ## Detailed Architecture
 This section describes each modeled device-side component in turn.
 Components are listed in the alphabetical order used by the
 simulator's source tree.
 ### forwarding
 <!-- src: ADR-0037 Context, Decision -->
 The forwarding component is the generic routing relay used wherever a
 node only needs to apply a small processing overhead and pass the
 request to the next hop. NOC routers, conn nodes, and ucie phys all
 reduce to this. Its first act on receiving a request is to apply the
 per-node overhead configured for it in the topology spec; after the
 overhead it simply hands the request to the next hop along the path.
 <!-- src: ADR-0037 Decision, Consequences -->
 The decision to share one implementation across these roles was made
 to keep the simulator's component set small without sacrificing
 modeling fidelity. Each instance still carries its own overhead and
 its own link bandwidth contributions, so different roles still produce
 different timing. What is shared is the dispatcher loop, not the
 parameter values.
 ### hbm_ctrl
 <!-- src: ADR-0034 Context, Decision -->
 The HBM controller is the terminal node for all memory traffic that
 reaches HBM. Internally it owns a number of pseudo channels, partitioned
 per-PE so that each PE addresses a deterministic subset. On a request
 arrival the controller first selects the right pseudo channel from the
 target address, then enters a chunk-loop that drains the requested
 size in fixed-size flits over the channel's bandwidth.
 <!-- src: ADR-0034 Decision, Consequences -->
 The chunk-loop pattern replaces an earlier all-at-once drain. The
 benefit is that the controller no longer presents a flit-aware fabric
 with a single bulk transfer; instead it emits flits at a paced rate
 matching the channel bandwidth, which makes cross-flow contention
 visible. The bandwidth budget is calibrated against the configured
 HBM total bandwidth divided across the channel count.
 ### io_cpu
 <!-- src: ADR-0036 Context, Decision -->
 The IO_CPU is the control-plane processor sitting inside the IO chiplet.
 It receives kernel-launch messages from the host, decodes them, and
 dispatches per-cube launches to the cube's management CPU. Pure memory
 operations bypass it entirely, taking the direct data path established
 inside the IO chiplet.
 <!-- src: ADR-0036 Decision -->
 On receiving a kernel-launch message, the IO_CPU consults the message's
 shard list — which already names the target SIP, cube, and PE for each
 piece of the tensor argument — and forwards a per-cube launch to each
 cube the kernel needs to reach. This makes the IO_CPU a deterministic
 fan-out point: it does not decode physical addresses to route, it just
 follows the explicit per-shard targets it was handed.
 ### m_cpu
 <!-- src: ADR-0035 Context, Decision -->
 The M_CPU is the cube's management processor. It owns two distinct
 roles: as a control-plane fan-out point for kernel launches arriving
 from the IO chiplet, and as a DMA endpoint for host-initiated memory
 writes that need to land in this cube's HBM. The control role
 forwards launches to the right PE control CPUs; the DMA role places
 the actual bytes into HBM through the router mesh.
 <!-- src: ADR-0035 Decision -->
 The component model deliberately distinguishes the two roles because
 their routing differs: the control fan-out path uses command-kind
 links that do not appear on data-path routes, while the DMA path uses
 the same router mesh as PE-initiated DMA, with PE-internal nodes
 excluded. The routing layer knows about both modes and selects the
 appropriate adjacency at request time.
 ### pcie_ep
 <!-- src: ADR-0038 Context, Decision -->
 The PCIE endpoint is the protocol boundary at the host-device edge.
 Its first act on each incoming request is to apply a configured
 protocol-processing overhead; after that it simply forwards. There is
 no internal queuing model, no retry, and no TLP-level fidelity — those
 are deliberately outside scope. The endpoint is bidirectional: host →
 device traffic (memory writes, kernel launches) flows one way, and
 device-side outbound traffic (cross-SIP collective sends) flows the
 other.
 <!-- src: ADR-0038 Decision, Alternatives Considered, Consequences -->
 A more detailed PCIe model was considered and rejected. The simulator
 is targeting system-level latency comparisons; making the endpoint
 heavier with credit-management and retry logic would not improve the
 metrics being studied. The decision keeps the endpoint as the
 documented protocol-boundary node, named consistently so routing
 helpers can locate it by SIP and IO instance.
 ### pe_cpu
 <!-- src: ADR-0014 Decision -->
 The PE control CPU is the entry point for kernel work arriving from
 the cube's management CPU. It receives kernel-launch messages, resolves
 the kernel function by name, and hands execution to the scheduler with
 the resolved tensor arguments. From the scheduler's point of view, the
 PE_CPU is the upstream source of high-level commands; from the rest
 of the system's point of view, the PE_CPU is where a kernel's
 execution begins on a given PE.
 ### pe_dma
 <!-- src: ADR-0014 Decision, ADR-0023 Decision -->
 The DMA engine on each PE has two distinct modes. In the standard PE
 pipeline it consumes tile tokens issued by the scheduler, acquires a
 read or write channel (modeled as a one-in-flight resource per
 direction), and runs the bytes to or from HBM through the mesh. In
 its collective mode it forwards send tokens for the cube's IPCQ into
 the fabric, snapshotting the source data at send time so later
 mutations cannot race the receiver's read. Both modes share the same
 channel resources but differ in their downstream handling — one
 returns when the round-trip completes, the other dispatches
 fire-and-forget.
 ### pe_fetch_store
 <!-- src: ADR-0014 Decision -->
 The fetch-store engine is the bridge between the on-PE scratchpad
 (TCM) and the register file. It does not run DMA; it only moves bytes
 internally. On receiving a tile-stage token it sends a short request
 to the TCM, waits for the bandwidth-serialized delay, and continues
 the pipeline. The split between this engine and the TCM lets the
 scratchpad model its own read/write bandwidth independently.
 ### pe_gemm
 <!-- src: ADR-0014 Decision -->
 The GEMM engine is the matrix-multiply compute unit. Tile tokens
 arriving at this stage carry the per-tile dimensions, and the engine
 contributes a service time accounting for one fused multiply-add over
 the tile's macs. Composite operations (where the same tensor pair is
 streamed across many tiles) reuse the engine through the scheduler;
 the engine itself is stateless between tiles.
 ### pe_ipcq
 <!-- src: ADR-0023 Context, Decision -->
 The IPCQ — inter-process communication queue — is each PE's
 collective-communication endpoint. It owns ring buffers that hold
 inbound messages from neighbor PEs and bookkeeping for send credits.
 Direction names ("N", "S", "E", "W" for cube-internal neighbors and
 "global_*" for cross-SIP neighbors) are resolved to physical peer
 endpoints by a neighbor table installed at process-group creation
 time. The component itself does not move bytes — it issues DMA tokens
 through the local PE_DMA, which performs the actual cross-PE
 transfer.
 <!-- src: ADR-0023 Decision, Consequences -->
 A key invariant is that the inbound terminal — where data lands at
 the receiver — pays the link bandwidth drain plus any cube-internal
 mesh hop to the slot's backing memory. This prevents IPCQ from
 silently outpacing raw DMA at large transfer sizes. Outbound sends
 are fire-and-forget; credit return is the only backpressure signal.
 ### pe_math
 <!-- src: ADR-0014 Decision -->
 The math engine handles element-wise and reduction operations. It
 consumes tile tokens carrying an operation kind (`exp`, `sum`, `max`,
 `where`, etc.) and contributes a service time proportional to the
 number of elements processed. Like the GEMM engine it is stateless;
 chained epilogues (a sequence of math operations after a GEMM tile)
 are scheduled as separate stages.
 ### pe_mmu
 <!-- src: ADR-0039 Context, Decision -->
 The MMU has two roles, exposed through one component. As a node on
 the cube NOC it receives MMU-map and MMU-unmap messages and updates
 its internal page table, so that the runtime API can install
 virtual-to-physical mappings with measured fabric latency. As a
 utility object held inside the PE it offers synchronous translate
 calls to the PE's DMA and GEMM engines without taking simulator time
 itself; the calling engine pays any configured TLB overhead in its
 own process.
 <!-- src: ADR-0039 Decision, Alternatives Considered -->
 The page table supports multiple disjoint regions inside a single
 page, with later-write-wins semantics on overlap. This is a deliberate
 simulator stopgap to support parallelization policies that shard data
 at sub-page granularity without silent mis-routing through a real
 hardware MMU's one-PA-per-entry assumption. A real MMU does not work
 this way; the model documents this as a simplification.
 ### pe_scheduler
 <!-- src: ADR-0014 Decision -->
 The scheduler is the sole dispatcher inside a PE. Simple commands are
 routed directly to the right engine. Composite commands generate a
 tile plan, and the resulting tile tokens are fed into the pipeline.
 Self-routing keeps the scheduler off the per-stage hot path: each
 engine, on finishing a stage, advances the token to the next stage's
 component itself, so the scheduler only does initial dispatch and
 completion tracking.
 ### pe_tcm
 <!-- src: ADR-0040 Context, Decision -->
 The TCM is the per-PE tightly-coupled scratchpad memory. It models
 time only, not data — the actual payload lives in the simulator's
 memory store. Read and write are independent channels: each is
 modeled as a one-in-flight resource, so same-direction requests
 serialize but a read and a write can overlap. The bandwidth of each
 direction is configured separately and applied as bytes-over-bandwidth
 on each request.
 <!-- src: ADR-0040 Decision, Alternatives Considered -->
 The decision to keep read and write on separate channels was made
 because the PE pipeline's normal case overlaps fetch (read) and store
 (write). Collapsing them into a single shared channel would have
 artificially serialized that overlap and produced an incorrect
 bandwidth ceiling.
 ### sram
 <!-- src: ADR-0041 Context, Decision -->
 The cube SRAM is a per-cube scratchpad attached to one of the cube's
 routers. As a node it applies a configured access overhead, pays the
 link-bandwidth drain stamped on the incoming request, and sends a
 response on the reverse path. It is a terminal — it does not forward.
 <!-- src: ADR-0041 Decision, Consequences -->
 A second role is as one of three backing-memory tiers (TCM, SRAM, HBM)
 that an inter-PE collective slot can live in. When the slot lives in
 SRAM, the PE_DMA pays the slot read or write latency directly using
 the configured SRAM bandwidth and overhead; the SRAM component does
 not need to know about collective semantics. This separation keeps
 the SRAM component agnostic to the collective subsystem.
 ### tiling
 <!-- src: ADR-0042 Context, Decision -->
 The tile-plan generator is not a runtime component — it is a pure
 module of functions that take a problem shape (matrix dimensions, tile
 sizes) and produce an ordered list of tile-stage sequences. The
 scheduler consumes this list. Each tile's stage sequence depends on
 how its operands are staged: operands streamed from HBM produce
 DMA_READ stages, operands already resident in TCM (because they were
 loaded eagerly upfront) skip them.
 <!-- src: ADR-0042 Decision, Consequences -->
 The plan generator is intentionally pure — given the same input it
 returns the same plan, with no simulator events created. This lets
 the rest of the system reason about tile sequences as data, and it
 makes the plan testable in isolation without simulator state. New
 plan variants (for example, K-major or DTensor-aware plans) can be
 added as new functions following the same shape.
 ---
 ## Implementation Decisions
 This section collects cross-cutting decisions — algorithms, policies,
 schemes, and contracts — that span multiple components rather than
 living inside one.
 ### Address Scheme
 <!-- src: ADR-0001 Context, Decision -->
 Every physical address in the simulator decodes into a structured
 location. A fixed-width physical address carries the SIP id, the
 cube id within the SIP, a type discriminator (HBM vs PE-resource vs
 others), and a type-specific offset. HBM addresses additionally encode
 the per-PE slice offset so the controller can determine which PE
 owns the target slice without external lookup. The layout is
 deliberately reserved rather than packed-to-fit, so new sub-units can
 be added at the type-discriminator level without rewriting existing
 addresses.
 <!-- src: ADR-0011 Context, Decision -->
 On top of physical addressing, the simulator supports three address
 models that the runtime API selects between. Direct physical
 addressing is retained as a fallback. Virtual addressing — the
 current default — gives each tensor a contiguous virtual range at
 deployment, with the per-PE MMU translating per access; an
 alternative logical-address scheme remains a future option. The
 virtual-address path is what every modern test path takes; the PA
 fallback is used by the MMU itself when no mapping exists for an
 address (a deliberate signal, not an error).
 <!-- src: ADR-0011 Decision, Consequences -->
 Tensor placement is represented as a list of physical-address shards,
 each tagged with target SIP, cube, and PE, plus a single tensor-wide
 virtual base. This means a kernel sees one virtual base for the whole
 tensor while the host driver and the engine still know exactly where
 each shard lives. Replicated tensors get per-cube local PA mappings;
 sharded tensors broadcast their mapping across cubes within a SIP.
 ### Routing, Distance & Helper API
 <!-- src: ADR-0002 Context, Decision -->
 Routing is policy-driven, deterministic, and topology-aware. Given a
 source, a destination, and an intent — for example, PE-initiated
 DMA versus host-initiated memory write versus a generic
 component-to-component query — the routing layer picks the right
 path. The intent matters because different traffic types must avoid
 different categories of edges: PE-initiated DMA should not traverse
 command-only links; M_CPU DMA should not pass through PE-internal
 pipeline edges; cube-local transfers should not use the
 zero-distance UCIe bus that would otherwise look attractive to a
 shortest-path search.
 <!-- src: ADR-0051 Decision -->
 The routing layer therefore maintains four separate adjacency graphs
 at construction, each excluding a different category of edges, and
 picks the appropriate one per intent. On top of the graphs sits a
 helper API that hides the topology's naming convention: callers ask
 for the PCIe endpoint of a given SIP, the M_CPU of a given cube, or
 the HBM destination for a given physical address, and receive the
 corresponding node id. No component constructs node-id strings
 directly; if the naming convention ever changes, the change is local
 to the helper layer.
 <!-- src: ADR-0051 Decision, Consequences -->
 Path-finding itself uses Dijkstra with explicit per-edge weights
 (routing weight is allowed to differ from physical distance — for
 example, UCIe is configured to be routing-preferable). Tie-breaks
 follow insertion order, which keeps results deterministic. Paths
 between unreachable nodes raise rather than returning empty, surfacing
 topology errors immediately.
 ### Memory Semantics and Local-HBM Bandwidth
 <!-- src: ADR-0004 Context, Decision -->
 A PE accessing its own HBM slice through its own cube's NOC must see
 the full local HBM bandwidth — that is the model's intent. Memory
 traffic accumulates latency from per-component overhead and
 bytes-over-link-bandwidth serialization along the path, but the
 controller does not throttle below the slice's allotted bandwidth.
 Cross-PE-slice accesses inside the same cube, cross-cube accesses
 through UCIe, and cross-SIP accesses through PCIe each pay
 progressively more overhead as the path grows.
 ### Topology Compilation, Diagrams & Builder Algorithms
 <!-- src: ADR-0006 Context, Decision -->
 Topology is configurable, not hardcoded. The simulator reads a YAML
 spec, compiles it into a flat graph of nodes and edges plus four
 view projections at different abstraction levels — system, SIP, cube,
 PE — and uses the compiled graph as the single source for both
 execution and visualization. Distance metadata used by routing is
 extracted at compile time so that diagrams and routing decisions
 agree by construction.
 <!-- src: ADR-0005 Context, Decision -->
 Diagrams are derived artifacts of the compiled topology. The visualizer
 produces one SVG per view at the appropriate abstraction level; nothing
 in the diagrams is hand-drawn or hand-positioned. Distance-aware
 layout rules place nodes in the diagrams using the same coordinates
 that routing uses to compute distance, so a diagram that "looks
 wrong" is a signal that the topology itself has a problem, not the
 visualizer.
 <!-- src: ADR-0053 Decision -->
 Inside a cube the router mesh is generated automatically. PE corner
 positions are fixed by convention; the relay-column algorithm
 inserts additional grid columns whenever the gap between adjacent PE
 columns would exceed a tunable maximum. HBM occupies a central
 exclusion zone — router slots inside the zone are deliberately empty,
 since HBM controllers attach as separate named nodes. M_CPU and SRAM
 attach to the nearest router by Euclidean distance from their
 configured placement coordinates, and UCIe physical lanes distribute
 along the boundary rows and columns. The whole mesh is cached
 beside the topology spec and invalidated only when one of a small set
 of layout-relevant fields changes.
 <!-- DIAGRAM: One cube's router mesh — rows × cols of routers with HBM exclusion zone in the middle, PEs/M_CPU/SRAM attaching to nearest routers, UCIe PHYs along the perimeter. -->
 ### Tensor Deployment and Allocation
 <!-- src: ADR-0008 Context, Decision -->
 Tensor deployment in the runtime API produces a list of physical-address
 shards plus a single tensor-wide virtual base. The host allocator
 walks the data-parallelism policy, computes per-shard placement, and
 emits the per-shard physical addresses through the per-PE allocators.
 No separate "allocate then later attach to a device" RPC exists —
 allocation and deployment are a single operation that produces a
 deployed tensor handle.
 ### Memory Allocator Algorithms
 <!-- src: ADR-0048 Context, Decision -->
 Each per-PE allocator owns two channels — HBM slice and TCM — each
 backed by an offset-keyed free-list. Allocation is first-fit; freeing
 coalesces with adjacent free blocks. A device-wide virtual allocator
 sits above the per-PE allocators, aligns requests up to the configured
 page size, and coalesces on free in the same way. The trade-off is
 explicit: first-fit is simpler and cheaper than best-fit or buddy
 allocation, and the simulator's workload is stack-like enough
 (deploy / kernel / free in matched order) that fragmentation is not
 a practical concern.
 <!-- src: ADR-0048 Decision, Consequences -->
 Allocation failure raises rather than silently returning a partial
 result. A partial tensor reaching the engine would route over wrong
 PAs and silently corrupt simulator output, so an out-of-memory signal
 is preferred. The free path trusts its caller to pass back exactly
 what was allocated; the small risk of caller error in exchange for
 fast common-case freeing is documented as a deliberate trade.
 ### Kernel Execution and Host-Device Messaging
 <!-- src: ADR-0009 Context, Decision -->
 Kernel execution decomposes into a small set of messages that travel
 the device graph. The host issues a single kernel-launch message; the
 IO_CPU fans it out per-cube; the cube M_CPU fans it out per-PE; the
 PE CPU resolves the kernel and runs it through the scheduler.
 Completion flows back the same way, gated by per-shard completion
 tracking. Memory operations follow the same pattern: a memory write
 or read travels as one message that the engine routes to the right
 HBM controller, with a response taking the reverse path.
 <!-- src: ADR-0012 Context, Decision -->
 The schema between the host and the device-side IO CPU is PA-first
 and shard-tagged. Every byte of host-issued payload arrives with an
 explicit target SIP, cube, PE, and physical address. The IO_CPU does
 not decode addresses to derive placement — placement is named
 explicitly by the shard list. This makes the host-device interface
 deterministic and keeps the routing helper free of host-derived
 intent.
 ### CLI Surface and Semantics
 <!-- src: ADR-0010 Context, Decision -->
 The command-line interface exposes four subcommands. A bench runner
 loads a topology, resolves a registered benchmark by name or index,
 and runs it on a selected device. A bench-listing command enumerates
 the registered benchmarks. A probe utility runs a fixed catalog of
 traffic patterns through the engine for latency and bandwidth
 verification. A web viewer renders the topology in a browser. A
 benchmark instance is always single-device by convention; multi-SIP
 collective work happens inside the benchmark through the launcher
 abstraction, not by multiplexing the CLI.
 ### Component Port and Wire Fabric Model
 <!-- src: ADR-0015 Context, Decision -->
 Every modeled component exposes input and output ports, and every
 edge in the topology connects an output port on one component to an
 input port on another. Bandwidth and propagation delay are properties
 of the wire between ports, not of the component endpoints. A
 component's responsibility is to apply its configured per-node
 overhead and either forward to the next hop or terminate; the wire
 charges the byte-over-bandwidth serialization separately.
 <!-- src: ADR-0015 Decision, Consequences -->
 This separation lets components be swapped behind their port
 interface without changing the rest of the model, and it keeps
 bandwidth contention at the wire level where multiple components may
 contend for the same edge. Future component models can refine
 internal behavior without disturbing the fabric.
 ### Two-Pass Data Execution
 <!-- src: ADR-0020 Context, Decision -->
 The simulator runs in two passes. The first pass — fast and always
 on — runs the discrete-event engine and records every data operation
 in an operation log with timestamps, component identifiers, and per-
 operation parameters. The second pass — optional, opt-in — replays
 the log against an in-memory tensor store to produce actual numerical
 results. Tests that only need timing skip the second pass; tests that
 need to verify correctness opt in.
 <!-- src: ADR-0020 Decision, Consequences -->
 The split lets the timing engine remain unconcerned with data
 semantics: kernels move handles around, not bytes. The replay phase
 recovers data semantics from the recorded operations, in their
 original time order with a small set of secondary-sort rules. The
 op-log records carry enough metadata — input snapshots for compute
 operations, source snapshots for cross-component copies — that the
 replay phase cannot mis-order with respect to in-flight mutations.
 ### Sim-engine Op Log and Memory Store Schemas
 <!-- src: ADR-0052 Context, Decision -->
 The operation log holds typed records with seven fields each: start
 and end timestamps, the component that issued the operation, an
 operation kind ("memory", "gemm", "math"), an operation name, a
 parameter dictionary, and a (currently unused) dependency list.
 Records are kept in stable timestamp order. The parameter dictionary
 varies by operation: a DMA read carries source address and byte count;
 a GEMM carries operand shapes, dtypes, and address spaces; a math
 operation carries input addresses and snapshots.
 <!-- src: ADR-0052 Decision, Consequences -->
 The companion memory store is a two-level dictionary keyed by
 address space ("hbm", "tcm", "sram", others) and integer address.
 Reads and writes are reference-based — no copy by default — so
 callers wanting to detach a snapshot must copy explicitly. This is
 deliberate: the engine-internal snapshot paths copy at well-defined
 points (math input capture, HBM source capture for DMA writes,
 inbound collective copies) and downstream replay code therefore
 sees stable data even when slot or scratch addresses are reused by
 later operations.
 ### 2D Grid Program Identity
 <!-- src: ADR-0022 Context, Decision -->
 Inside a kernel the program identity is two-dimensional. The
 first axis corresponds to the PE index within a cube; the second
 corresponds to the cube index within a SIP. Together they let a
 kernel address its position both within its cube and within the
 larger system without needing to know the full topology. Total
 program counts along each axis are exposed symmetrically.
 ### Parallelism — SIP Launcher, DPPolicy, Megatron-TP, AHBM Backend, and CCL Algorithm Module
 <!-- src: ADR-0024 Context, Decision -->
 The launcher model treats each SIP as one rank. Inside a process the
 launcher spawns one greenlet per SIP rank; the rank is bound to its
 greenlet so that any code running in that worker sees the right
 distributed-style rank. This is a deliberately PyTorch-compatible
 shape: a benchmark looks like a small DDP training script — initialize
 a process group, spawn workers, each worker runs the same body.
 <!-- src: ADR-0026 Context, Decision -->
 Data-parallelism policy lives in a single object that names the
 sharding strategy along the cube axis (replicate, row-wise,
 column-wise) and along the PE axis (same set of values), and optionally
 overrides the number of cubes or PEs participating. The policy is
 intra-device — it does not cross SIP boundaries. SIP-level parallelism
 is the launcher's responsibility, and the two axes compose
 orthogonally.
 <!-- src: ADR-0027 Context, Decision -->
 A Megatron-style tensor-parallel API sits on top of the launcher and
 the DP policy. Layer-level building blocks — column-parallel linear,
 row-parallel linear, all-reduce — name their sharding intent in terms
 the launcher and the placement policy can compose. This is the layer
 that bench code typically writes against.
 <!-- src: ADR-0047 Context, Decision -->
 For collective operations the runtime exposes a PyTorch-compatible
 distributed backend named "ahbm". On process-group initialization the
 backend loads the configured collective-algorithm module, resolves
 the world size (priority: explicit ccl.yaml override → defaults
 section → topology SIP count), imports the algorithm module
 dynamically, derives the SIP topology kind, and pushes the inter-PE
 neighbor table to every participating PE. From that point on, an
 all-reduce call dispatches the algorithm's kernel function across
 all ranks.
 <!-- src: ADR-0050 Context, Decision -->
 A collective-algorithm module is a Python module with a small, fixed
 contract. It exposes topology-kind integer constants, a name-to-kind
 mapping for the YAML configuration, a kernel-arguments builder, and
 a kernel function — the kernel function being aliased to the name
 `kernel` so the backend can find it generically. The kernel itself
 takes the tensor pointer, the per-cube element count, cube mesh
 width and height, the world size, the current rank, and the SIP
 topology dimensions; the backend appends those last four arguments
 automatically. New collectives slot in by adding a new module that
 follows this shape.
 <!-- src: ADR-0027 Decision, Consequences -->
 The combination is deliberate: bench authors get to write code that
 looks like a regular distributed training script, while the launcher,
 backend, and placement policies behind it remain free to redirect
 work to the right SIP, cube, and PE without exposing topology to the
 kernel.
 ### IPCQ Direction Addressing
 <!-- src: ADR-0025 Context, Decision -->
 Inside a collective algorithm, peer PEs are named by direction —
 "N", "S", "E", "W" for cube-internal neighbors, and "global_*" for
 cross-SIP neighbors. Direction addressing is the addressing scheme:
 the algorithm names a direction, the IPCQ neighbor table installed
 at process-group time resolves the direction to the peer endpoint's
 physical-address coordinates, and the PE_DMA performs the actual
 transfer. The algorithm itself does not see PA arithmetic — direction
 is the user-facing handle.
 ### Intercube All-Reduce
 <!-- src: ADR-0032 Context, Decision -->
 The default all-reduce algorithm uses a center-rooted bidirectional
 phase inside each SIP's cube mesh followed by an inter-SIP exchange
 on the mesh's root cube, and then a bidirectional broadcast back
 out. Center-rooting halves the in-cube hop count compared with a
 corner-rooted walk. The inter-SIP exchange itself follows the
 configured SIP topology — ring, torus, or non-wrapping mesh —
 selected at runtime through the SIP-topology kind integer the
 backend passes to the kernel.
 ### Evaluation Harnesses
 <!-- src: ADR-0043 Context, Decision -->
 The all-reduce evaluation harness drives correctness and the
 latency/buffer-kind sweeps through the public distributed path —
 initialize process group, spawn workers, call all-reduce — rather
 than the lower-level engine interface. A shared helper module factors
 out the setup; sweep tests cover the buffer-kind tiers (TCM, SRAM,
 HBM) and the inter-SIP topology variants. The plots produced by the
 harness are part of its output contract; the harness regenerates them
 on demand.
 <!-- src: ADR-0044 Context, Decision -->
 The GEMM evaluation harness is split into two layers. A heavy
 shape-and-variant sweep lives as a manual script — it runs the same
 composite-GEMM benchmark across many shapes and operand-staging
 variants, harvests the resulting op-log, and writes a JSON summary.
 A faster figure-generation layer lives in the test suite and consumes
 that JSON to render plots. The split keeps the heavy data
 generation explicit and out of the regular test path.
 ### Bench Module Contract
 <!-- src: ADR-0045 Context, Decision -->
 Adding a new benchmark requires only dropping a file into the
 benchmarks directory. The file registers one or more benchmark
 functions through a small decorator that takes a kebab-case name and
 a human-readable description. The decorator is the registration
 mechanism — there is no separate manifest. Each benchmark function
 takes one argument, conventionally named `torch`, which is the
 runtime context exposing tensor allocation, kernel launch,
 distributed APIs, and process-spawning. The function name is `run` by
 convention.
 <!-- src: ADR-0045 Decision, Consequences -->
 A benchmark must submit at least one operation, or the runner
 returns an error. A benchmark instance is single-device by default;
 when a benchmark is collective, it uses the distributed-process-spawn
 pattern internally — one worker greenlet per rank, with each worker
 binding to its rank. Multi-device benchmark patterns outside that
 shape are not supported.
 ### Kernel-side `tl.*` API
 <!-- src: ADR-0046 Context, Decision -->
 Inside a kernel function, the `tl` argument exposes the kernel-side
 API in a shape that mirrors the conventions of established
 GPU-kernel languages. Categories: reference handles that name HBM
 data without issuing DMA; data movement (load, store) that does
 issue DMA; GEMM and math compute (dot, composite, the unary and
 binary math operations, reductions); index and scalar helpers
 (program identity, range-builders); metadata-only operations like
 transpose; and the collective primitives (send, receive,
 non-blocking receive). Tensor handles support arithmetic operators
 via a thread-local active context so kernel code reads naturally.
 <!-- src: ADR-0046 Decision, Consequences -->
 The API supports two execution modes. A command-list mode records
 operations into a list without consuming simulator time — useful for
 inspection and lightweight tests. A greenlet-driven mode runs the
 kernel as a child greenlet that switches back to the simulator on
 each `tl.*` call; the simulator drives the event scheduler and hands
 real data back to the kernel as DMA reads complete. The two modes
 share the same surface; the kernel does not know which one it is
 running under.
 ### Probe Subcommand
 <!-- src: ADR-0049 Context, Decision -->
 The probe utility runs three families of traffic patterns through
 the engine — host-to-device writes at increasing hop counts,
 device-to-host reads at increasing hop counts, and PE-initiated DMA
 across the cube mesh — and reports actual latency, the analytical
 formula breakdown, effective bandwidth, bottleneck bandwidth, and
 utilization. A fixed reference size is used for the summary table;
 a separate utilization-versus-size sweep covers a logarithmic range
 of transfer sizes. Each case runs in its own engine instance so
 cases do not perturb each other.
 <!-- src: ADR-0049 Decision, Consequences -->
 The probe also checks a small set of invariants automatically:
 monotonic latency increase with hop count, device-to-host latency
 at least as large as host-to-device for the same hop count, and a
 faster best-case path than worst-case for cross-cube PE DMA. Failures
 print prominently. The output is meant for human reading; automated
 parsing should not depend on column widths or whitespace.
 ---
 This document summarizes 46 architecture decisions captured during
 the first half of 2026. It is regenerated mechanically from the
 decision corpus; sources are recorded in HTML comments throughout.
@@ -1,237 +1,20 @@
 """Sweep GEMM shapes through kernbench and dump PE_accelerator engine times.
-For each shape:
+Thin wrapper: the sweep logic now lives in
-  - run benches.matmul_composite via the same run_bench path the CLI uses
+``kernbench.benches.milestone_1h_gemm`` (the single home, ADR-0054, also the
-  - read result.engine.op_log
+``milestone-1h-gemm`` bench). This script remains the manual entry point for
-  - filter to per-PE engines: pe_dma, pe_fetch_store, pe_gemm, pe_math
+regenerating ``docs/diagrams/gemm_sweep.json`` on demand and honors the same
-  - record sum-of-durations (engine occupancy) AND wall-clock active interval
+``SWEEP_SHAPES`` / ``SWEEP_TOPOLOGY`` env overrides.
-Output: docs/diagrams/gemm_sweep.json
+    python scripts/gemm_sweep.py
 """
 from __future__ import annotations
-import json
+from kernbench.benches.milestone_1h_gemm import run_sweep
 import os
 import sys
 import time
 from pathlib import Path
 # Default sweep covering under-tile, single-tile, multi-tile, and asymmetric regimes.
 # Each entry is either a single integer (square M=K=N=S) or "MxKxN".
 # Override via env: SWEEP_SHAPES="16,32,16x2048x16,..."
 DEFAULT_SHAPES = [
    "32x32x32",       # 1 tile, K=32 < TILE_K=64 → under-tile in K
    "32x64x32",       # 1 tile, exact single-tile fit
    "32x128x32",      # 2 tiles, aligned
    "32x128x128",     # 8 tiles, aligned
    "32x3072x32",     # 48 tiles, all K-axis (tall-skinny)
    "8x128x128",      # 8 tiles, but M=8 < TILE_M=32 → MAC array under-fed
    "128x8x128",      # 16 tiles, but K=8 < TILE_K=64 → MAC array under-fed
    "512",            # 2048 tiles, fully aligned — "well-pipelined" reference
 ]
 # Operand-staging variants exercised per shape.
 VARIANTS = ["ref_ref", "load_ref", "load_load"]
 # Engines whose timings we collect (component_id suffix match).
 ENGINES = ["pe_dma", "pe_fetch_store", "pe_gemm", "pe_math"]
 # Per-stage breakdown labels (StageType enum names from pe_types.py).
 STAGES = ["DMA_READ", "DMA_WRITE", "FETCH", "STORE", "GEMM", "MATH"]
 # Scheduler tile sizes (mirror of PeSchedulerComponent.TILE_M/K/N).
 TILE_M, TILE_K, TILE_N = 32, 64, 32
 OUT_PATH = Path(__file__).parent.parent / "docs" / "diagrams" / "gemm_sweep.json"
 def _engine_wall_ns(records, suffix: str) -> float:
    """Wall-clock interval the engine was active (union of overlapping ops)."""
    intervals = [(r.t_start, r.t_end) for r in records
                 if r.component_id.endswith("." + suffix)]
    if not intervals:
        return 0.0
    intervals.sort()
    merged_end = intervals[0][1]
    merged_start = intervals[0][0]
    total = 0.0
    for s, e in intervals[1:]:
        if s <= merged_end:
            merged_end = max(merged_end, e)
        else:
            total += merged_end - merged_start
            merged_start, merged_end = s, e
    total += merged_end - merged_start
    return total
 def _engine_occupancy_ns(records, suffix: str) -> float:
    return sum(r.t_end - r.t_start for r in records
               if r.component_id.endswith("." + suffix))
 def _engine_count(records, suffix: str) -> int:
    return sum(1 for r in records if r.component_id.endswith("." + suffix))
 def _stage_occupancy_ns(records, stage_type: str) -> float:
    """Sum t_end - t_start over op_log records whose params.stage_type matches.
    Requires op_log records produced post the TileToken stage_type capture
    (sim_engine/op_log.py).
    """
    return sum(
        r.t_end - r.t_start
        for r in records
        if r.params.get("stage_type") == stage_type
    )
 def _stage_wall_ns(records, stage_type: str) -> float:
    """Interval-union wall-clock for records whose stage_type matches."""
    intervals = sorted(
        (r.t_start, r.t_end) for r in records
        if r.params.get("stage_type") == stage_type
    )
    if not intervals:
        return 0.0
    total = 0.0
    cs, ce = intervals[0]
    for s, e in intervals[1:]:
        if s <= ce:
            ce = max(ce, e)
        else:
            total += ce - cs
            cs, ce = s, e
    total += ce - cs
    return total
 def _stage_count(records, stage_type: str) -> int:
    return sum(1 for r in records if r.params.get("stage_type") == stage_type)
 def _run_one(M: int, K: int, N: int, topology: str, variant: str = "ref_ref") -> dict:
    os.environ["MATMUL_M"] = str(M)
    os.environ["MATMUL_K"] = str(K)
    os.environ["MATMUL_N"] = str(N)
    os.environ["MATMUL_VARIANT"] = variant
    # Late imports so env vars are read by matmul_composite at module load.
    # Force re-import to pick up new env values.
    for mod_name in [m for m in list(sys.modules) if m.startswith("kernbench.benches.matmul_composite")]:
        del sys.modules[mod_name]
    from kernbench.benches.registry import resolve as resolve_bench
    from kernbench.runtime_api.bench_runner import run_bench
    from kernbench.runtime_api.types import resolve_device
    from kernbench.sim_engine.engine import GraphEngine
    from kernbench.topology.builder import resolve_topology
    topo = resolve_topology(topology)
    bench = resolve_bench("matmul-composite").run
    device = resolve_device(None)
    t0 = time.time()
    result = run_bench(
        topology=topo, bench_fn=bench, device=device,
        engine_factory=lambda t, d: GraphEngine(
            getattr(t, "topology_obj", t), enable_data=True,
        ),
    )
    wall = time.time() - t0
    op_log = result.engine.op_log
    if not result.completion.ok:
        raise RuntimeError(f"bench failed at M={M},K={K},N={N}: {result.completion}")
    # Bytes touched at f16 (2 B): full A + full B + full out (each operand
    # streamed once through HBM by the composite plan).
    bytes_total = (M * K + K * N + M * N) * 2
    row = {
        "M": M, "K": K, "N": N,
        "variant": variant,
        "flops": 2 * M * K * N,
        "bytes_hbm": bytes_total,
        "arith_intensity": (2 * M * K * N) / bytes_total,  # flops/byte
        "tile_count_expected": _ceil(M, TILE_M) * _ceil(N, TILE_N) * _ceil(K, TILE_K),
        "sim_wall_clock_s": round(wall, 3),
        "engines": {},
    }
    for eng in ENGINES:
        row["engines"][eng] = {
            "occupancy_ns": _engine_occupancy_ns(op_log, eng),
            "wall_ns":      _engine_wall_ns(op_log, eng),
            "record_count": _engine_count(op_log, eng),
        }
    row["stages"] = {}
    for stage in STAGES:
        row["stages"][stage] = {
            "occupancy_ns": _stage_occupancy_ns(op_log, stage),
            "wall_ns":      _stage_wall_ns(op_log, stage),
            "record_count": _stage_count(op_log, stage),
        }
    # Kernel-window wall-clock = max t_end - min t_start over PE engine records.
    pe_records = [r for r in op_log
                  if any(r.component_id.endswith("." + e) for e in ENGINES)]
    if pe_records:
        row["pe_window_ns"] = max(r.t_end for r in pe_records) \
                              - min(r.t_start for r in pe_records)
    else:
        row["pe_window_ns"] = 0.0
    stage_records = [r for r in op_log
                     if r.params.get("stage_type") in STAGES]
    if stage_records:
        row["composite_window_ns"] = max(r.t_end for r in stage_records) \
                                     - min(r.t_start for r in stage_records)
    else:
        row["composite_window_ns"] = 0.0
    return row
 def _ceil(a: int, b: int) -> int:
    return (a + b - 1) // b
 def main() -> int:
-    shapes_env = os.environ.get("SWEEP_SHAPES")
+    run_sweep()
    raw = (shapes_env.split(",") if shapes_env else DEFAULT_SHAPES)
    shapes: list[tuple[int, int, int]] = []
    for s in raw:
        s = s.strip()
        if not s:
            continue
        if "x" in s.lower():
            parts = s.lower().split("x")
            shapes.append((int(parts[0]), int(parts[1]), int(parts[2])))
        else:
            v = int(s)
            shapes.append((v, v, v))
    topology = os.environ.get("SWEEP_TOPOLOGY", "topology.yaml")
    rows = []
    for M, K, N in shapes:
        for variant in VARIANTS:
            print(f"[sweep] M={M} K={K} N={N} variant={variant} ...", flush=True)
            row = _run_one(M, K, N, topology, variant=variant)
            rows.append(row)
            eng_dma = row["engines"]["pe_dma"]
            eng_gem = row["engines"]["pe_gemm"]
            print(f"   tiles={row['tile_count_expected']:>6}  "
                  f"pe_window={row['pe_window_ns']:8.1f}ns  "
                  f"dma_occ={eng_dma['occupancy_ns']:9.1f}  "
                  f"gemm_occ={eng_gem['occupancy_ns']:8.1f}  "
                  f"(sim {row['sim_wall_clock_s']:.1f}s)")
    OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
    OUT_PATH.write_text(json.dumps({
        "tile_sizes": {"M": TILE_M, "K": TILE_K, "N": TILE_N},
        "engines": ENGINES,
        "stages": STAGES,
        "variants": VARIANTS,
        "rows": rows,
    }, indent=2))
    print(f"\n[sweep] wrote {OUT_PATH}")
    return 0
@@ -0,0 +1,13 @@
 buffer_kind,sip_topology,n_sips,n_elem,bytes_per_pe,latency_ns
 hbm,torus_2d,6,128,256,2120.040000000012
 hbm,torus_2d,6,1024,2048,2717.2783333333473
 hbm,torus_2d,6,8192,16384,7315.184999999989
 hbm,torus_2d,6,32768,65536,23081.26500000037
 sram,torus_2d,6,128,256,2060.040000000012
 sram,torus_2d,6,1024,2048,2909.2783333333473
 sram,torus_2d,6,8192,16384,9523.184999999869
 sram,torus_2d,6,32768,65536,32201.265000000385
 tcm,torus_2d,6,128,256,1964.040000000012
 tcm,torus_2d,6,1024,2048,2477.2783333333473
 tcm,torus_2d,6,8192,16384,6403.185000000109
 tcm,torus_2d,6,32768,65536,19865.265000000378
@@ -0,0 +1,37 @@
 algorithm,sip_topology,n_sips,n_elem,bytes_per_pe,bytes_per_sip,latency_ns
 lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,8,16,256,2666.552500000015
 lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,32,64,1024,2747.7400000000152
 lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,64,128,2048,2855.990000000018
 lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,128,256,4096,3072.490000000019
 lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,512,1024,16384,3337.1133333333582
 lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,1024,2048,32768,3708.0333333333692
 lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,2048,4096,65536,4449.873333333393
 lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,4096,8192,131072,5933.020000000124
 lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,8192,16384,262144,8900.379999999863
 lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,16384,32768,524288,14835.099999999224
 lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,32768,65536,1048576,26704.540000000765
 lrab_hierarchical_allreduce,mesh_2d_no_wrap,6,49152,98304,1572864,38573.97999999701
 lrab_hierarchical_allreduce,ring_1d,6,8,16,256,2365.255833333347
 lrab_hierarchical_allreduce,ring_1d,6,32,64,1024,2436.9433333333473
 lrab_hierarchical_allreduce,ring_1d,6,64,128,2048,2532.526666666683
 lrab_hierarchical_allreduce,ring_1d,6,128,256,4096,2723.693333333349
 lrab_hierarchical_allreduce,ring_1d,6,512,1024,16384,3048.635000000021
 lrab_hierarchical_allreduce,ring_1d,6,1024,2048,32768,3393.4016666666957
 lrab_hierarchical_allreduce,ring_1d,6,2048,4096,65536,4082.401666666714
 lrab_hierarchical_allreduce,ring_1d,6,4096,8192,131072,5458.80166666677
 lrab_hierarchical_allreduce,ring_1d,6,8192,16384,262144,8216.934999999943
 lrab_hierarchical_allreduce,ring_1d,6,16384,32768,524288,13733.201666665835
 lrab_hierarchical_allreduce,ring_1d,6,32768,65536,1048576,24765.73500000064
 lrab_hierarchical_allreduce,ring_1d,6,49152,98304,1572864,35798.268333331536
 lrab_hierarchical_allreduce,torus_2d,6,8,16,256,1700.6025000000095
 lrab_hierarchical_allreduce,torus_2d,6,32,64,1024,1753.2900000000102
 lrab_hierarchical_allreduce,torus_2d,6,64,128,2048,1823.540000000012
 lrab_hierarchical_allreduce,torus_2d,6,128,256,4096,1964.040000000012
 lrab_hierarchical_allreduce,torus_2d,6,512,1024,16384,2196.8183333333463
 lrab_hierarchical_allreduce,torus_2d,6,1024,2048,32768,2477.2783333333473
 lrab_hierarchical_allreduce,torus_2d,6,2048,4096,65536,3038.1983333333583
 lrab_hierarchical_allreduce,torus_2d,6,4096,8192,131072,4159.5050000000665
 lrab_hierarchical_allreduce,torus_2d,6,8192,16384,262144,6403.185000000109
 lrab_hierarchical_allreduce,torus_2d,6,16384,32768,524288,10890.5449999995
 lrab_hierarchical_allreduce,torus_2d,6,32768,65536,1048576,19865.265000000378
 lrab_hierarchical_allreduce,torus_2d,6,49152,98304,1572864,28839.98500000059
@@ -0,0 +1,180 @@
 """Mesh-native bidirectional Ring-K/V attention kernel — prefill (ADR-0059 Proposed).
 Each rank holds its own Q tile and 1/n_ranks of K, V (sequence-sharded).
 Over ``n_ranks - 1`` bidirectional steps, K and V propagate both east and
 west: chunk c_i originating at rank i reaches rank j at step ``|i - j|``.
 Every rank receives every other rank's chunk **exactly once** and folds it
 into a running ``(m, ℓ, o)`` via the online-softmax recurrence. After all
 steps each rank holds the final attention output for its own Q tokens —
 no cross-rank merge is required.
 Supersedes ADR-0055's closed-ring ``_attention_ring_kv.py``. Both modules
 stay on disk during the transition; this one runs on the hardware's
 actual open-mesh wiring (no closed-ring SFR install required).
 Imported by ``milestone_gqa_llama70b`` (after the bench's Phase 2 switches
 its imports) and invoked through ``torch.launch(...)`` — not through
 ``dist.all_reduce(...)``. See ADR-0055 Context for why this kernel is not
 backend-dispatched via ADR-0050's algorithm-module contract.
 """
 from __future__ import annotations
 from kernbench.common.pe_commands import TensorHandle
 def _view(handle: TensorHandle, new_shape: tuple[int, ...]) -> TensorHandle:
    """Reshape — metadata only, no command emitted (cf. ``tl.trans``)."""
    return TensorHandle(
        id=handle.id,
        addr=handle.addr,
        shape=new_shape,
        dtype=handle.dtype,
        nbytes=handle.nbytes,
        data=handle.data,
        space=handle.space,
        pinned=handle.pinned,
    )
 def _partial_attention(
    Q: TensorHandle,
    K: TensorHandle,
    V: TensorHandle,
    S_q: int,
    S_kv_per_rank: int,
    h_q: int,
    d_head: int,
    tl,
 ) -> tuple[TensorHandle, TensorHandle, TensorHandle]:
    """One pass of partial attention against (K, V).
    Emits 1 GEMM(Q·K^T) + softmax + max + sub + exp + sum + 1 GEMM(P·V).
    Returns the running-statistics triplet ``(m, ℓ, O_partial)`` for the
    online-softmax mlo merge.
    """
    K_2d_T = _view(K, (h_q * d_head, S_kv_per_rank))
    V_2d = _view(V, (S_kv_per_rank, h_q * d_head))
    scores = tl.dot(Q, K_2d_T)
    m = tl.max(scores, axis=-1)
    P = tl.softmax(scores, axis=-1)
    scores_centered = scores - m
    exp_scores = tl.exp(scores_centered)
    ell = tl.sum(exp_scores, axis=-1)
    O_partial = tl.dot(P, V_2d)
    return m, ell, O_partial
 def attention_mesh_kv_kernel(
    q_ptr: int,
    k_ptr: int,
    v_ptr: int,
    o_ptr: int,
    S_q: int,
    S_kv_per_rank: int,
    h_q: int,
    h_kv: int,
    d_head: int,
    n_ranks: int,
    rank_axis: int = 0,
    *,
    tl,
 ) -> None:
    """Mesh-native bidirectional Ring-K/V attention — see module docstring.
    ``rank_axis`` selects which program-id dimension carries the ring rank:
        0 — single_user_* panels: rank == tl.program_id(axis=0) (PE id in cube).
        1 — multi_user_* panels: ring is at the cube level. Only PE 0 in each
            cube participates; the other 7 hold KV replicas but stay silent.
    """
    # For multi_user (rank_axis=1) only PE 0 in each cube runs the ring.
    if rank_axis != 0 and tl.program_id(axis=0) != 0:
        return
    rank = tl.program_id(axis=rank_axis)
    has_E = rank < n_ranks - 1
    has_W = rank > 0
    # Q stays put on this rank — loaded once, used in every partial attention.
    Q = tl.load(q_ptr, shape=(S_q, h_q * d_head), dtype="f16")
    # Local K, V chunk.
    K = tl.load(k_ptr, shape=(S_kv_per_rank, h_kv, d_head), dtype="f16")
    V = tl.load(v_ptr, shape=(S_kv_per_rank, h_kv, d_head), dtype="f16")
    # Step 0 (local): partial attention against own K, V — initializes the
    # running triplet (m, ℓ, o).
    m, ell, o = _partial_attention(
        Q, K, V, S_q, S_kv_per_rank, h_q, d_head, tl,
    )
    # Seed bidirectional waves with own chunk (step-1 send).
    to_send_east_K: TensorHandle | None = K
    to_send_east_V: TensorHandle | None = V
    to_send_west_K: TensorHandle | None = K
    to_send_west_V: TensorHandle | None = V
    # Bidirectional fan-out: n_ranks - 1 steps. By step k, the wave from
    # rank i has reached rank (i ± k). After n_ranks - 1 steps, every rank
    # has merged every other rank's chunk exactly once (ADR-0059 D3).
    for step in range(1, n_ranks):
        # Send the eastbound wave we currently hold (own at step 1; forwarded
        # at later steps). ``None`` means we have no wave to forward this
        # direction this step (edge rank, or the wave already passed by).
        if has_E and to_send_east_K is not None:
            tl.send(dir="E", src=to_send_east_K)
            tl.send(dir="E", src=to_send_east_V)
        if has_W and to_send_west_K is not None:
            tl.send(dir="W", src=to_send_west_K)
            tl.send(dir="W", src=to_send_west_V)
        # Receive eastbound wave from W (carries chunk c_{rank - step}).
        K_from_W: TensorHandle | None = None
        V_from_W: TensorHandle | None = None
        if has_W and (rank - step) >= 0:
            K_from_W = tl.recv(
                dir="W", shape=(S_kv_per_rank, h_kv, d_head), dtype="f16",
            )
            V_from_W = tl.recv(
                dir="W", shape=(S_kv_per_rank, h_kv, d_head), dtype="f16",
            )
            m_new, ell_new, o_new = _partial_attention(
                Q, K_from_W, V_from_W, S_q, S_kv_per_rank, h_q, d_head, tl,
            )
            m_combined = tl.maximum(m, m_new)
            scale_old = tl.exp(m - m_combined)
            scale_new = tl.exp(m_new - m_combined)
            ell = ell * scale_old + ell_new * scale_new
            o = o * scale_old + o_new * scale_new
            m = m_combined
        # Receive westbound wave from E (carries chunk c_{rank + step}).
        K_from_E: TensorHandle | None = None
        V_from_E: TensorHandle | None = None
        if has_E and (rank + step) < n_ranks:
            K_from_E = tl.recv(
                dir="E", shape=(S_kv_per_rank, h_kv, d_head), dtype="f16",
            )
            V_from_E = tl.recv(
                dir="E", shape=(S_kv_per_rank, h_kv, d_head), dtype="f16",
            )
            m_new, ell_new, o_new = _partial_attention(
                Q, K_from_E, V_from_E, S_q, S_kv_per_rank, h_q, d_head, tl,
            )
            m_combined = tl.maximum(m, m_new)
            scale_old = tl.exp(m - m_combined)
            scale_new = tl.exp(m_new - m_combined)
            ell = ell * scale_old + ell_new * scale_new
            o = o * scale_old + o_new * scale_new
            m = m_combined
        # Forward what we received for next step. ``None`` propagates: if no
        # chunk arrived this step (out-of-bounds wave origin), there is
        # nothing to forward next step in that direction.
        to_send_east_K = K_from_W
        to_send_east_V = V_from_W
        to_send_west_K = K_from_E
        to_send_west_V = V_from_E
    # Final normalize: O := o / ℓ.
    O_final = o / ell
    tl.store(o_ptr, O_final)
@@ -0,0 +1,151 @@
 """Mesh-native bidirectional AllReduce-mlo attention — decode (ADR-0059 Proposed).
 Every rank holds the full Q (replicated, small at ``S_q=1``) and 1/n_ranks
 of KV (sequence-sharded). Each rank computes its partial attention
 against own KV in ONE shot, then runs a bidirectional fan-out of the
 ``(m, ℓ, o)`` triplet: the triplet originating at rank i reaches rank j at
 step ``|i - j|``. Every rank merges every other rank's triplet exactly
 once over ``n_ranks - 1`` steps, ending with the final answer replicated
 on every rank.
 Supersedes ADR-0056's closed-ring ``_attention_allreduce_mlo.py``. Both
 modules stay on disk during the transition; this one runs on the
 hardware's actual open-mesh wiring (no closed-ring SFR install required).
 Imported by ``milestone_gqa_llama70b`` (after the bench's Phase 2 switches
 its imports) and invoked through ``torch.launch(...)`` — not through
 ``dist.all_reduce(...)``. See ADR-0056 Context for why this kernel is not
 backend-dispatched via ADR-0050's algorithm-module contract.
 """
 from __future__ import annotations
 from kernbench.common.pe_commands import TensorHandle
 def _view(handle: TensorHandle, new_shape: tuple[int, ...]) -> TensorHandle:
    """Reshape — metadata only, no command emitted (cf. ``tl.trans``)."""
    return TensorHandle(
        id=handle.id,
        addr=handle.addr,
        shape=new_shape,
        dtype=handle.dtype,
        nbytes=handle.nbytes,
        data=handle.data,
        space=handle.space,
        pinned=handle.pinned,
    )
 def attention_mesh_mlo_kernel(
    q_ptr: int,
    k_ptr: int,
    v_ptr: int,
    o_ptr: int,
    S_q: int,
    S_kv_per_rank: int,
    h_q: int,
    h_kv: int,
    d_head: int,
    n_ranks: int,
    rank_axis: int = 0,
    *,
    tl,
 ) -> None:
    """Mesh-native bidirectional AllReduce-mlo — see module docstring.
    ``rank_axis`` selects which program-id dimension carries the ring rank:
        0 — single_user_* panels: rank == tl.program_id(axis=0) (PE id in cube).
        1 — multi_user_* panels: ring is at the cube level. Only PE 0 in each
            cube participates; the other 7 hold KV replicas but stay silent.
    """
    # For multi_user (rank_axis=1) only PE 0 in each cube runs the ring.
    if rank_axis != 0 and tl.program_id(axis=0) != 0:
        return
    rank = tl.program_id(axis=rank_axis)
    has_E = rank < n_ranks - 1
    has_W = rank > 0
    # Q is replicated on every rank — loaded once.
    Q = tl.load(q_ptr, shape=(S_q, h_q * d_head), dtype="f16")
    # Local KV chunk. KV is sequence-sharded and stays put on this rank for
    # the entire fan-out — distinguishing decode from prefill (ADR-0059 D3)
    # where KV circulates.
    K = tl.load(k_ptr, shape=(S_kv_per_rank, h_kv, d_head), dtype="f16")
    V = tl.load(v_ptr, shape=(S_kv_per_rank, h_kv, d_head), dtype="f16")
    # ── One-shot local partial attention ──────────────────────────
    K_2d_T = _view(K, (h_q * d_head, S_kv_per_rank))
    V_2d = _view(V, (S_kv_per_rank, h_q * d_head))
    scores = tl.dot(Q, K_2d_T)
    m = tl.max(scores, axis=-1)
    P = tl.softmax(scores, axis=-1)
    scores_centered = scores - m
    exp_scores = tl.exp(scores_centered)
    ell = tl.sum(exp_scores, axis=-1)
    o = tl.dot(P, V_2d)
    # Seed bidirectional waves with own triplet (step-1 send).
    to_send_east_m: TensorHandle | None = m
    to_send_east_ell: TensorHandle | None = ell
    to_send_east_o: TensorHandle | None = o
    to_send_west_m: TensorHandle | None = m
    to_send_west_ell: TensorHandle | None = ell
    to_send_west_o: TensorHandle | None = o
    # Bidirectional fan-out of (m, ℓ, o) triplets — n_ranks - 1 steps.
    for step in range(1, n_ranks):
        # Send eastbound triplet (own at step 1; forwarded at later steps).
        if has_E and to_send_east_m is not None:
            tl.send(dir="E", src=to_send_east_m)
            tl.send(dir="E", src=to_send_east_ell)
            tl.send(dir="E", src=to_send_east_o)
        # Send westbound triplet.
        if has_W and to_send_west_m is not None:
            tl.send(dir="W", src=to_send_west_m)
            tl.send(dir="W", src=to_send_west_ell)
            tl.send(dir="W", src=to_send_west_o)
        # Receive eastbound triplet from W (originated at rank - step).
        m_from_W: TensorHandle | None = None
        ell_from_W: TensorHandle | None = None
        o_from_W: TensorHandle | None = None
        if has_W and (rank - step) >= 0:
            m_from_W = tl.recv(dir="W", shape=m.shape, dtype="f16")
            ell_from_W = tl.recv(dir="W", shape=ell.shape, dtype="f16")
            o_from_W = tl.recv(dir="W", shape=o.shape, dtype="f16")
            m_combined = tl.maximum(m, m_from_W)
            scale_old = tl.exp(m - m_combined)
            scale_new = tl.exp(m_from_W - m_combined)
            ell = ell * scale_old + ell_from_W * scale_new
            o = o * scale_old + o_from_W * scale_new
            m = m_combined
        # Receive westbound triplet from E (originated at rank + step).
        m_from_E: TensorHandle | None = None
        ell_from_E: TensorHandle | None = None
        o_from_E: TensorHandle | None = None
        if has_E and (rank + step) < n_ranks:
            m_from_E = tl.recv(dir="E", shape=m.shape, dtype="f16")
            ell_from_E = tl.recv(dir="E", shape=ell.shape, dtype="f16")
            o_from_E = tl.recv(dir="E", shape=o.shape, dtype="f16")
            m_combined = tl.maximum(m, m_from_E)
            scale_old = tl.exp(m - m_combined)
            scale_new = tl.exp(m_from_E - m_combined)
            ell = ell * scale_old + ell_from_E * scale_new
            o = o * scale_old + o_from_E * scale_new
            m = m_combined
        # Forward the original received triplet (not the merged running state)
        # so neighbors get the original wave. ``None`` propagates if nothing
        # arrived this step.
        to_send_east_m = m_from_W
        to_send_east_ell = ell_from_W
        to_send_east_o = o_from_W
        to_send_west_m = m_from_E
        to_send_west_ell = ell_from_E
        to_send_west_o = o_from_E
    # Final normalize: O := o / ℓ.
    O_final = o / ell
    tl.store(o_ptr, O_final)
@@ -0,0 +1,568 @@
 """milestone-1h-gemm bench: GEMM evaluation harness (sweep + figures).
 Self-contained milestone bench (ADR-0054). Holds the shape×variant sweep
 and the figure renderers; the ``run(torch)`` entry at the bottom runs the
 sweep (or reuses the committed JSON when ``MILESTONE_FAST=1``) and writes
 every figure into ``benches/1H_milestone_output/gemm/``.
 This is the single home for the GEMM eval logic: the figure tests import a
 thin re-export shim (``tests/gemm/_gemm_plot_helpers.py``), as does the
 ``scripts/gemm_sweep.py`` wrapper.
 The sweep drives ``matmul-composite`` across shapes×variants through the
 same ``run_bench`` path the CLI uses, harvests ``result.engine.op_log``,
 and writes the sweep JSON. The renderers read that JSON and emit matplotlib
 PNGs. No simulation in the renderers — they are fast.
 Chart set (mirrors the GEMM MAC slides in scripts/build_overview_slides.py):
  - stage breakdown (load_ref operand staging)
  - MAC utilization — measured (load_ref)
  - MAC utilization — theoretical vs measured (load_ref)
 """
 from __future__ import annotations
 import json
 import os
 import sys
 import time
 from pathlib import Path
 from kernbench.benches.registry import bench
 from kernbench.policy.placement.dp import DPPolicy
 ROOT = Path(__file__).resolve().parents[3]
 DEFAULT_SWEEP_JSON = ROOT / "docs" / "diagrams" / "gemm_sweep.json"
 DEFAULT_PLOTS_DIR = ROOT / "docs" / "diagrams" / "gemm_plots"
 _OUTPUT_DIR = Path(__file__).resolve().parent / "1H_milestone_output" / "gemm"
 # ── sweep configuration ────────────────────────────────────────────────
 # Default sweep covering under-tile, single-tile, multi-tile, and asymmetric
 # regimes. Each entry is "MxKxN" or a single int (square M=K=N).
 # Override via env: SWEEP_SHAPES="16,32,16x2048x16,..."
 DEFAULT_SHAPES = [
    "32x32x32",       # 1 tile, K=32 < TILE_K=64 → under-tile in K
    "32x64x32",       # 1 tile, exact single-tile fit
    "32x128x32",      # 2 tiles, aligned
    "32x128x128",     # 8 tiles, aligned
    "32x3072x32",     # 48 tiles, all K-axis (tall-skinny)
    "8x128x128",      # 8 tiles, but M=8 < TILE_M=32 → MAC array under-fed
    "128x8x128",      # 16 tiles, but K=8 < TILE_K=64 → MAC array under-fed
    "512",            # 2048 tiles, fully aligned — "well-pipelined" reference
 ]
 # Operand-staging variants exercised per shape.
 VARIANTS = ["ref_ref", "load_ref", "load_load"]
 # Engines whose timings we collect (component_id suffix match).
 ENGINES = ["pe_dma", "pe_fetch_store", "pe_gemm", "pe_math"]
 # Per-stage breakdown labels (StageType enum names from pe_types.py).
 STAGES = ["DMA_READ", "DMA_WRITE", "FETCH", "STORE", "GEMM", "MATH"]
 # Scheduler tile sizes (mirror of PeSchedulerComponent.TILE_M/K/N).
 TILE_M, TILE_K, TILE_N = 32, 64, 32
 def _ceil(a: int, b: int) -> int:
    return (a + b - 1) // b
 def _engine_wall_ns(records, suffix: str) -> float:
    """Wall-clock interval the engine was active (union of overlapping ops)."""
    intervals = [(r.t_start, r.t_end) for r in records
                 if r.component_id.endswith("." + suffix)]
    if not intervals:
        return 0.0
    intervals.sort()
    merged_end = intervals[0][1]
    merged_start = intervals[0][0]
    total = 0.0
    for s, e in intervals[1:]:
        if s <= merged_end:
            merged_end = max(merged_end, e)
        else:
            total += merged_end - merged_start
            merged_start, merged_end = s, e
    total += merged_end - merged_start
    return total
 def _engine_occupancy_ns(records, suffix: str) -> float:
    return sum(r.t_end - r.t_start for r in records
               if r.component_id.endswith("." + suffix))
 def _engine_count(records, suffix: str) -> int:
    return sum(1 for r in records if r.component_id.endswith("." + suffix))
 def _stage_occupancy_ns(records, stage_type: str) -> float:
    return sum(
        r.t_end - r.t_start
        for r in records
        if r.params.get("stage_type") == stage_type
    )
 def _stage_wall_ns(records, stage_type: str) -> float:
    """Interval-union wall-clock for records whose stage_type matches."""
    intervals = sorted(
        (r.t_start, r.t_end) for r in records
        if r.params.get("stage_type") == stage_type
    )
    if not intervals:
        return 0.0
    total = 0.0
    cs, ce = intervals[0]
    for s, e in intervals[1:]:
        if s <= ce:
            ce = max(ce, e)
        else:
            total += ce - cs
            cs, ce = s, e
    total += ce - cs
    return total
 def _stage_count(records, stage_type: str) -> int:
    return sum(1 for r in records if r.params.get("stage_type") == stage_type)
 def _run_one(M: int, K: int, N: int, topology: str, variant: str = "ref_ref") -> dict:
    os.environ["MATMUL_M"] = str(M)
    os.environ["MATMUL_K"] = str(K)
    os.environ["MATMUL_N"] = str(N)
    os.environ["MATMUL_VARIANT"] = variant
    # Late imports so env vars are read by matmul_composite at module load.
    # Force re-import to pick up new env values.
    for mod_name in [m for m in list(sys.modules)
                     if m.startswith("kernbench.benches.matmul_composite")]:
        del sys.modules[mod_name]
    from kernbench.benches.registry import resolve as resolve_bench
    from kernbench.runtime_api.bench_runner import run_bench
    from kernbench.runtime_api.types import resolve_device
    from kernbench.sim_engine.engine import GraphEngine
    from kernbench.topology.builder import resolve_topology
    topo = resolve_topology(topology)
    bench = resolve_bench("matmul-composite").run
    device = resolve_device(None)
    t0 = time.time()
    result = run_bench(
        topology=topo, bench_fn=bench, device=device,
        engine_factory=lambda t, d: GraphEngine(
            getattr(t, "topology_obj", t), enable_data=True,
        ),
    )
    wall = time.time() - t0
    op_log = result.engine.op_log
    if not result.completion.ok:
        raise RuntimeError(f"bench failed at M={M},K={K},N={N}: {result.completion}")
    # Bytes touched at f16 (2 B): full A + full B + full out (each operand
    # streamed once through HBM by the composite plan).
    bytes_total = (M * K + K * N + M * N) * 2
    row = {
        "M": M, "K": K, "N": N,
        "variant": variant,
        "flops": 2 * M * K * N,
        "bytes_hbm": bytes_total,
        "arith_intensity": (2 * M * K * N) / bytes_total,  # flops/byte
        "tile_count_expected": _ceil(M, TILE_M) * _ceil(N, TILE_N) * _ceil(K, TILE_K),
        "sim_wall_clock_s": round(wall, 3),
        "engines": {},
    }
    for eng in ENGINES:
        row["engines"][eng] = {
            "occupancy_ns": _engine_occupancy_ns(op_log, eng),
            "wall_ns":      _engine_wall_ns(op_log, eng),
            "record_count": _engine_count(op_log, eng),
        }
    row["stages"] = {}
    for stage in STAGES:
        row["stages"][stage] = {
            "occupancy_ns": _stage_occupancy_ns(op_log, stage),
            "wall_ns":      _stage_wall_ns(op_log, stage),
            "record_count": _stage_count(op_log, stage),
        }
    # Kernel-window wall-clock = max t_end - min t_start over PE engine records.
    pe_records = [r for r in op_log
                  if any(r.component_id.endswith("." + e) for e in ENGINES)]
    if pe_records:
        row["pe_window_ns"] = max(r.t_end for r in pe_records) \
                              - min(r.t_start for r in pe_records)
    else:
        row["pe_window_ns"] = 0.0
    stage_records = [r for r in op_log
                     if r.params.get("stage_type") in STAGES]
    if stage_records:
        row["composite_window_ns"] = max(r.t_end for r in stage_records) \
                                     - min(r.t_start for r in stage_records)
    else:
        row["composite_window_ns"] = 0.0
    return row
 def _parse_shapes(raw) -> list[tuple[int, int, int]]:
    shapes: list[tuple[int, int, int]] = []
    for s in raw:
        s = s.strip()
        if not s:
            continue
        if "x" in s.lower():
            parts = s.lower().split("x")
            shapes.append((int(parts[0]), int(parts[1]), int(parts[2])))
        else:
            v = int(s)
            shapes.append((v, v, v))
    return shapes
 def run_sweep(out_json: Path | str = DEFAULT_SWEEP_JSON) -> Path:
    """Drive matmul-composite across shapes×variants; write the sweep JSON.
    Honors ``SWEEP_SHAPES`` / ``SWEEP_TOPOLOGY`` env overrides (same as the
    historical ``scripts/gemm_sweep.py``). Returns the JSON path written.
    """
    shapes_env = os.environ.get("SWEEP_SHAPES")
    raw = (shapes_env.split(",") if shapes_env else DEFAULT_SHAPES)
    shapes = _parse_shapes(raw)
    topology = os.environ.get("SWEEP_TOPOLOGY", "topology.yaml")
    rows = []
    for M, K, N in shapes:
        for variant in VARIANTS:
            print(f"[sweep] M={M} K={K} N={N} variant={variant} ...", flush=True)
            row = _run_one(M, K, N, topology, variant=variant)
            rows.append(row)
            eng_dma = row["engines"]["pe_dma"]
            eng_gem = row["engines"]["pe_gemm"]
            print(f"   tiles={row['tile_count_expected']:>6}  "
                  f"pe_window={row['pe_window_ns']:8.1f}ns  "
                  f"dma_occ={eng_dma['occupancy_ns']:9.1f}  "
                  f"gemm_occ={eng_gem['occupancy_ns']:8.1f}  "
                  f"(sim {row['sim_wall_clock_s']:.1f}s)")
    out_json = Path(out_json)
    out_json.parent.mkdir(parents=True, exist_ok=True)
    out_json.write_text(json.dumps({
        "tile_sizes": {"M": TILE_M, "K": TILE_K, "N": TILE_N},
        "engines": ENGINES,
        "stages": STAGES,
        "variants": VARIANTS,
        "rows": rows,
    }, indent=2))
    print(f"\n[sweep] wrote {out_json}")
    return out_json
 # ── figure rendering ───────────────────────────────────────────────────
 # Shapes excluded from the figures (mirrors build_overview_slides).
 EXCLUDED_SHAPES = {(512, 512, 512)}
 # Stage bars shown (raw op_log stage_type keys) + display names + colors.
 STAGE_KEYS = ["DMA_READ", "FETCH", "GEMM", "DMA_WRITE"]
 STAGE_DISPLAY = {
    "DMA_READ":  "DMA in",
    "FETCH":     "Fetch",
    "GEMM":      "GEMM",
    "DMA_WRITE": "DMA out",
 }
 STAGE_COLORS = {
    "DMA_READ":  "#3B82F6",
    "FETCH":     "#10B981",
    "GEMM":      "#F59E0B",
    "DMA_WRITE": "#A855F7",
 }
 # MAC-utilization model constants (mirror build_overview_slides).
 _HBM_GBS = 256.0
 _BPE = 2
 _T_STAGE = 16.0
 _D_STAGES = 3
 _PLOT_VARIANT = "load_ref"
 def _load_sweep_data(sweep_json: Path | str = DEFAULT_SWEEP_JSON) -> dict:
    sweep_json = Path(sweep_json)
    if not sweep_json.exists():
        return {"rows": []}
    data = json.loads(sweep_json.read_text())
    data["rows"] = [
        r for r in data.get("rows", [])
        if (r["M"], r["K"], r["N"]) not in EXCLUDED_SHAPES
    ]
    return data
 def _shape_label(r: dict) -> str:
    if r["M"] == r["K"] == r["N"]:
        return f"M=K=N={r['M']}"
    return f"M={r['M']} K={r['K']} N={r['N']}"
 def _under_tile(M, K, N, tile_M, tile_K, tile_N) -> bool:
    return M < tile_M or K < tile_K or N < tile_N
 def _xtick_labels(shape_labels, tile_counts, flagged) -> list[str]:
    out = []
    for lbl, tc, fl in zip(shape_labels, tile_counts, flagged):
        s = f"{lbl}\n({tc} tiles)"
        if fl:
            s += " *"
        out.append(s)
    return out
 def _grouped_bar_png(
    out_name: str, *, out_dir: Path, title: str, subtitle: str | None,
    shape_labels, tile_counts, flagged, series: dict, colors: dict,
    y_label: str, threshold: float | None = None, footnote: str | None = None,
 ) -> str:
    """Render one grouped-bar chart to out_dir/out_name; return the path."""
    import matplotlib.pyplot as plt
    import numpy as np
    n_groups = len(shape_labels)
    n_series = max(1, len(series))
    x = np.arange(n_groups)
    width = 0.8 / n_series
    fig, ax = plt.subplots(figsize=(11, 6))
    for i, (name, vals) in enumerate(series.items()):
        offset = (i - (n_series - 1) / 2) * width
        ax.bar(x + offset, vals, width, label=name, color=colors.get(name))
    ax.set_xticks(x)
    ax.set_xticklabels(
        _xtick_labels(shape_labels, tile_counts, flagged), fontsize=8,
    )
    ax.set_ylabel(y_label)
    ax.set_title(title, fontsize=13, fontweight="bold")
    if subtitle:
        ax.text(0.5, 1.01, subtitle, transform=ax.transAxes, ha="center",
                va="bottom", fontsize=8, color="#475569")
    if threshold is not None:
        ax.axhline(threshold, ls="--", color="gray", lw=1.0)
    ax.legend(fontsize=8, loc="upper right")
    ax.grid(True, axis="y", alpha=0.3)
    caption = "* = under-tile shape (M<TILE_M, K<TILE_K, or N<TILE_N)"
    if footnote:
        caption = footnote + "\n" + caption
    fig.text(0.5, 0.01, caption, ha="center", fontsize=7, color="gray",
             wrap=True)
    fig.tight_layout(rect=(0, 0.05, 1, 1))
    out_dir = Path(out_dir)
    out_dir.mkdir(parents=True, exist_ok=True)
    out = out_dir / out_name
    fig.savefig(out, dpi=120)
    plt.close(fig)
    return str(out)
 def emit_stage_breakdown(
    sweep_json: Path | str = DEFAULT_SWEEP_JSON,
    out_dir: Path | str = DEFAULT_PLOTS_DIR,
 ) -> str | None:
    """Per-stage engine wall-clock per shape (load_ref operand staging)."""
    data = _load_sweep_data(sweep_json)
    rows = [r for r in data["rows"] if r.get("variant") == _PLOT_VARIANT]
    if not rows:
        return None
    tile = data["tile_sizes"]
    shape_labels = [_shape_label(r) for r in rows]
    flagged = [_under_tile(r["M"], r["K"], r["N"], tile["M"], tile["K"], tile["N"])
               for r in rows]
    tile_counts = [r["tile_count_expected"] for r in rows]
    series = {
        STAGE_DISPLAY[s]: [r.get("stages", {}).get(s, {}).get("wall_ns", 0.0)
                           for r in rows]
        for s in STAGE_KEYS
    }
    colors = {STAGE_DISPLAY[s]: STAGE_COLORS[s] for s in STAGE_KEYS}
    return _grouped_bar_png(
        "gemm_stage_breakdown.png", out_dir=Path(out_dir),
        title="GEMM stage breakdown",
        subtitle=(f"Per-stage engine wall-clock (DMA in / Fetch / GEMM / "
                  f"DMA out), {_PLOT_VARIANT} staging. "
                  f"Tile {tile['M']}x{tile['K']}x{tile['N']}."),
        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
        series=series, colors=colors, y_label="ns",
        footnote="Bars = engine wall-clock interval (merged overlaps).",
    )
 def emit_mac_utilization_measured(
    sweep_json: Path | str = DEFAULT_SWEEP_JSON,
    out_dir: Path | str = DEFAULT_PLOTS_DIR,
 ) -> str | None:
    """GEMM util % and useful pipeline-eff % (analytical model, load_ref)."""
    data = _load_sweep_data(sweep_json)
    rows = data["rows"]
    if not rows:
        return None
    tile = data["tile_sizes"]
    TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
    tile_flops = 2 * TILE_M * TILE_K * TILE_N
    dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
    head_ns = (_D_STAGES - 1) * _T_STAGE
    by_shape = {(r["M"], r["K"], r["N"]): r
                for r in rows if r["variant"] == _PLOT_VARIANT}
    shapes = list(by_shape)
    if not shapes:
        return None
    shape_labels = [_shape_label(by_shape[k]) for k in shapes]
    flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
    tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
    gemm_util, useful_eff = [], []
    for k in shapes:
        r = by_shape[k]
        M, K, N = r["M"], r["K"], r["N"]
        useful = 2 * M * K * N
        tiles = r["tile_count_expected"]
        gu = useful / (tile_flops * tiles) * 100
        gemm_util.append(gu)
        m_tiles = (M + TILE_M - 1) // TILE_M
        n_tiles = (N + TILE_N - 1) // TILE_N
        n_mn = m_tiles * n_tiles
        compute_total = tiles * _T_STAGE
        wall = head_ns + tiles * _T_STAGE + max(0, n_mn - 1) * dma_w_per_pair
        ueff = (compute_total * (gu / 100.0) / wall) * 100 if wall > 0 else 0.0
        useful_eff.append(ueff)
    series = {"GEMM util %": gemm_util, "Useful eff %": useful_eff}
    colors = {"GEMM util %": "#10B981", "Useful eff %": "#F59E0B"}
    return _grouped_bar_png(
        "gemm_mac_utilization_measured.png", out_dir=Path(out_dir),
        title="GEMM MAC utilization — load_ref",
        subtitle=("GEMM util = useful FLOPs / (tile FLOPs x tiles); "
                  "Useful eff = GEMM util x ideal pipeline efficiency."),
        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
        series=series, colors=colors, y_label="%", threshold=100.0,
        footnote="Theoretical ideal-pipeline model (not simulator data).",
    )
 def emit_mac_utilization_theoretical_vs_measured(
    sweep_json: Path | str = DEFAULT_SWEEP_JSON,
    out_dir: Path | str = DEFAULT_PLOTS_DIR,
 ) -> str | None:
    """Theoretical vs simulator-measured GEMM util / useful eff (load_ref)."""
    data = _load_sweep_data(sweep_json)
    rows = data["rows"]
    if not rows:
        return None
    tile = data["tile_sizes"]
    TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
    tile_flops = 2 * TILE_M * TILE_K * TILE_N
    dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
    head_ns = (_D_STAGES - 1) * _T_STAGE
    peak_per_ns = tile_flops / _T_STAGE
    by_shape = {(r["M"], r["K"], r["N"]): r
                for r in rows if r["variant"] == _PLOT_VARIANT}
    shapes = list(by_shape)
    if not shapes:
        return None
    shape_labels = [_shape_label(by_shape[k]) for k in shapes]
    flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
    tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
    gu_t, gu_m, eff_t, eff_m = [], [], [], []
    for k in shapes:
        r = by_shape[k]
        M, K, N = r["M"], r["K"], r["N"]
        useful = 2 * M * K * N
        tiles = r["tile_count_expected"]
        gut = useful / (tile_flops * tiles)
        gu_t.append(gut * 100)
        rec = r.get("stages", {}).get("GEMM", {}).get("record_count", 0) or tiles
        gu_m.append((useful / (tile_flops * rec) * 100) if rec else 0.0)
        m_tiles = (M + TILE_M - 1) // TILE_M
        n_tiles = (N + TILE_N - 1) // TILE_N
        n_mn = m_tiles * n_tiles
        compute_total = tiles * _T_STAGE
        wall_t = head_ns + compute_total + max(0, n_mn - 1) * dma_w_per_pair
        eff_t.append((compute_total * gut / wall_t * 100) if wall_t > 0 else 0.0)
        cw = r.get("composite_window_ns", 0.0) or 0.0
        eff_m.append((useful / cw / peak_per_ns * 100) if cw > 0 else 0.0)
    series = {
        "GEMM util % (theoretical)": gu_t,
        "GEMM util % (measured)":    gu_m,
        "Theoretical eff %":         eff_t,
        "Measured eff %":            eff_m,
    }
    colors = {
        "GEMM util % (theoretical)": "#10B981",
        "GEMM util % (measured)":    "#6EE7B7",
        "Theoretical eff %":         "#F59E0B",
        "Measured eff %":            "#3B82F6",
    }
    return _grouped_bar_png(
        "gemm_mac_utilization_theoretical_vs_measured.png", out_dir=Path(out_dir),
        title="GEMM MAC utilization — theoretical vs measured (load_ref)",
        subtitle=("theoretical model vs simulator op_log; agreement "
                  "validates the analytical pipeline model."),
        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
        series=series, colors=colors, y_label="%", threshold=100.0,
    )
 def emit_all_gemm_plots(
    sweep_json: Path | str = DEFAULT_SWEEP_JSON,
    out_dir: Path | str = DEFAULT_PLOTS_DIR,
 ) -> list[str]:
    """Render every GEMM figure that has data; return the paths written."""
    paths = []
    for fn in (emit_stage_breakdown,
               emit_mac_utilization_measured,
               emit_mac_utilization_theoretical_vs_measured):
        p = fn(sweep_json, out_dir)
        if p:
            paths.append(p)
    return paths
 # ── bench entry ────────────────────────────────────────────────────────
@bench(
    name="milestone-1h-gemm",
    description="1H milestone: regenerate all GEMM results + figures.",
 )
 def run(torch) -> None:
    """Run the GEMM sweep (or reuse committed JSON) and render every figure.
    ``MILESTONE_FAST=1`` reuses the committed ``DEFAULT_SWEEP_JSON`` (seconds);
    otherwise the full sweep runs into ``out_dir/gemm_sweep.json`` (minutes).
    The sweep drives its own engines, so a sentinel tensor is submitted at the
    end to satisfy the run_bench contract (ADR-0045 D4).
    """
    _OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    fast = bool(os.environ.get("MILESTONE_FAST"))
    if fast:
        sweep_json = DEFAULT_SWEEP_JSON
    else:
        sweep_json = run_sweep(out_json=_OUTPUT_DIR / "gemm_sweep.json")
    paths = emit_all_gemm_plots(sweep_json=sweep_json, out_dir=_OUTPUT_DIR)
    print(f"  milestone-1h-gemm: {len(paths)} figures -> {_OUTPUT_DIR} "
          f"(fast={fast})")
    torch.zeros(
        (1, 1), dtype="f16",
        dp=DPPolicy(cube="row_wise", pe="replicate", num_cubes=1, num_pes=1),
        name="milestone_gemm_sentinel",
    )
@@ -609,6 +609,7 @@ class RuntimeContext:
        kernel_fn: Any,
        *args: Any,
        _defer_wait: bool = False,
        _auto_dim_remap: bool = True,
        **kwargs: Any,
    ) -> RequestHandle:
        """Register and launch a kernel (like a fused torch op).
@@ -700,21 +701,36 @@ class RuntimeContext:
                return t.shape
            # ADR-0026: DPPolicy no longer crosses SIP boundaries; cube + PE
            # are the only axes that shrink the local shape.
            # Mirror the tensor allocator's precedence (context.py L471-484):
            # DPPolicy.num_cubes overrides the topology's cube count when set.
            # Without this, multi_user panels at validation scale
            # (DPPolicy.num_cubes=4) get sharded as if the topology's full
            # cube count (16) applied — see test_launch_dim_translation.py.
            if dp.cube != "replicate":
                eff_num_cubes = (
                    dp.num_cubes if dp.num_cubes is not None else self._num_cubes
                )
                if dp.cube == "column_wise":
-                    K = K // self._num_cubes
+                    K = K // eff_num_cubes
                elif dp.cube == "row_wise":
-                    M = M // self._num_cubes
+                    M = M // eff_num_cubes
            if len(t.shape) < 2:
                return (K,)
            return (M, K)
        # Auto-dim-remap (opt-out via _auto_dim_remap=False). Legacy
        # kernels (e.g. va_offset bench) pass global dims as scalars and
        # rely on launch to rewrite them to local. Mesh attention kernels
        # already receive cube-local dims (S_kv_per_rank, d_head, …) and
        # opt out — the remap would otherwise collide d_head=64 with K's
        # global M=64 and rewrite d_head. See test_launch_dim_translation.py.
        dim_map: dict[int, int] = {}  # global_dim → local_dim
-        for t in tensor_args:
+        if _auto_dim_remap:
-            local = _compute_local_shape(t)
+            for t in tensor_args:
-            for g, l in zip(t.shape if len(t.shape) >= 2 else (1, t.shape[0]), local if len(local) >= 2 else (1, local[0])):
+                local = _compute_local_shape(t)
-                if g != l:
+                for g, l in zip(t.shape if len(t.shape) >= 2 else (1, t.shape[0]), local if len(local) >= 2 else (1, local[0])):
-                    dim_map[g] = l
+                    if g != l:
                        dim_map[g] = l
        # Per-SIP kernel launch: each SIP gets TensorArgs with local va_base
        last_handle = None
@@ -25,6 +25,37 @@ class DataExecutor:
    def __init__(self, op_log: list[OpRecord], store: MemoryStore) -> None:
        self._op_log = op_log
        self.store = store
        # Per-slot time-ordered shape-keyed history. Populated on every
        # ipcq_copy WRITE; consulted on reads that find a shape-mismatched
        # value in MemoryStore (the slot was wrapped by a later inbound
        # before this read's Phase 2 turn). Required because Phase 1 cannot
        # snapshot math-output sources at outbound time (math executes only
        # in Phase 2), so token.data is None and slot wraps lose the recv-
        # time value. See test_attention_mesh_decode_diag (ADR-0059 mesh).
        self._slot_history: dict[tuple[str, int], list[tuple[float, Any]]] = {}
    def _resolve_read(
        self, space: str, addr: int,
        shape: tuple[int, ...] | None, dtype: str | None,
        t_at_or_before: float,
    ) -> Any:
        """Read (space, addr) with expected shape. On KeyError or shape
        mismatch in MemoryStore, fall back to ``_slot_history`` for the
        most recent shape-matching entry with t_write <= t_at_or_before.
        Returns None when no match is found."""
        try:
            return self.store.read(space, addr, shape=shape, dtype=dtype)
        except (KeyError, ValueError):
            pass
        hist = self._slot_history.get((space, addr))
        if hist is None:
            return None
        for t_w, val in reversed(hist):
            if t_w > t_at_or_before:
                continue
            if shape is None or getattr(val, "shape", None) == shape:
                return val
        return None
    # Ordering priority within the same t_start: memory copies must run
    # before math/gemm so that slot data is populated before a consumer
@@ -87,14 +118,23 @@ class DataExecutor:
            # only get populated by Phase 2's math replay).
            data = p.get("snapshot")
            if data is None:
-                try:
+                data = self._resolve_read(
-                    data = self.store.read(
+                    src_space, src_addr,
-                        src_space, src_addr,
+                    p.get("shape"), p.get("dtype"), op.t_start,
-                        shape=p.get("shape"), dtype=p.get("dtype"),
+                )
-                    )
+                if data is None:
                except KeyError:
                    return
            self.store.write(dst_space, dst_addr, data)
            # Record this write in slot history so a later forwarded read
            # at src=dst_addr (a different ipcq_copy whose src is this slot)
            # can recover by shape even after the slot has been wrapped.
            if op.op_name == "ipcq_copy":
                self._slot_history.setdefault(
                    (dst_space, dst_addr), [],
                ).append((
                    op.t_start,
                    data.copy() if hasattr(data, "copy") else data,
                ))
    def _execute_gemm(self, op: OpRecord) -> None:
        """Execute GEMM: out = a @ b."""
@@ -110,10 +150,16 @@ class DataExecutor:
        dtype_in = p.get("dtype_in", "f16")
        dtype_out = p.get("dtype_out", dtype_in)
-        a = self.store.read(src_a_space, p["src_a_addr"],
+        a = self._resolve_read(src_a_space, p["src_a_addr"],
-                            shape=p.get("shape_a"), dtype=dtype_in)
+                               p.get("shape_a"), dtype_in, op.t_start)
-        b = self.store.read(src_b_space, p["src_b_addr"],
+        if a is None:
-                            shape=p.get("shape_b"), dtype=dtype_in)
+            a = self.store.read(src_a_space, p["src_a_addr"],
                                shape=p.get("shape_a"), dtype=dtype_in)
        b = self._resolve_read(src_b_space, p["src_b_addr"],
                               p.get("shape_b"), dtype_in, op.t_start)
        if b is None:
            b = self.store.read(src_b_space, p["src_b_addr"],
                                shape=p.get("shape_b"), dtype=dtype_in)
        # Compute in higher precision if specified
        dtype_acc = p.get("dtype_acc", "f32")
@@ -150,8 +196,11 @@ class DataExecutor:
        ):
            if snap is not None:
                inputs.append(snap)
-            else:
+                continue
-                inputs.append(self.store.read(space, addr, shape=shape, dtype=idtype))
+            resolved = self._resolve_read(space, addr, shape, idtype, op.t_start)
            if resolved is None:
                resolved = self.store.read(space, addr, shape=shape, dtype=idtype)
            inputs.append(resolved)
        result = _compute_math(math_op, inputs, p.get("axis"))
        if result is not None:
@@ -96,13 +96,20 @@ class OpLogger:
        # gets reused on the next ring round).
        if self._memory_store is not None:
            if op_kind == "math":
                handle_snaps = params.get("input_handle_data") or ()
                snaps: list[Any] = []
-                for addr, shape, space, idtype in zip(
+                for i, (addr, shape, space, idtype) in enumerate(zip(
                    params.get("input_addrs", []),
                    params.get("input_shapes", []),
                    params.get("input_spaces", []),
                    params.get("input_dtypes", []),
-                ):
+                )):
                    if i < len(handle_snaps) and handle_snaps[i] is not None:
                        carried = handle_snaps[i]
                        snaps.append(
                            carried.copy() if hasattr(carried, "copy") else carried
                        )
                        continue
                    try:
                        arr = self._memory_store.read(
                            space, addr, shape=shape, dtype=idtype,
@@ -111,6 +118,7 @@ class OpLogger:
                    except Exception:
                        snaps.append(None)
                params["input_snapshots"] = snaps
                params.pop("input_handle_data", None)
            elif op_name == "dma_write":
                # ADR-0027 fix: only snapshot HBM sources. TCM (PE scratch)
                # sources are repopulated by Phase 2 math/gemm replay —
@@ -222,6 +230,7 @@ def _extract_op_info(msg: Any) -> tuple[str, str, dict[str, Any]]:
            "input_shapes": [h.shape for h in msg.inputs],
            "input_spaces": [getattr(h, "space", "tcm") for h in msg.inputs],
            "input_dtypes": [h.dtype for h in msg.inputs],
            "input_handle_data": tuple(getattr(h, "data", None) for h in msg.inputs),
            "dst_addr": msg.out.addr,
            "dst_space": getattr(msg.out, "space", "tcm"),
            "shape_out": msg.out.shape,
@@ -0,0 +1,198 @@
 """End-to-end engine drives for the four GQA Llama-70B panels (sub-cycle 4c step 2).
 Mirrors the existing single_user_decode diag harness across all four panels
 of the milestone-gqa-llama70b sweep (ADR-0057):
  single_user_prefill   ring-K/V kernel, intracube PE ring  (8 PEs / 1 cube)
  single_user_decode    allreduce-mlo kernel, intracube PE ring
  multi_user_prefill    ring-K/V kernel, intercube multisip (4 cubes)
  multi_user_decode     allreduce-mlo kernel, intercube multisip
 Each test runs the panel through ``run_bench`` with ``enable_data=True``
 and asserts ``result.completion.ok``. Failures dump the engine's op_log
 tail and the exception, mirroring the decode-diag harness format.
 Validation-scale config matches ADR-0057 D4:
  S_q_prefill=16, S_kv_per_rank=16, h_q=h_kv=1, d_head=64
  n_ranks_single_user=8, n_ranks_multi_user=4
 """
 from __future__ import annotations
 import traceback
 from pathlib import Path
 import pytest
 from kernbench.benches._attention_mesh_kv import attention_mesh_kv_kernel
 from kernbench.benches._attention_mesh_mlo import attention_mesh_mlo_kernel
 from kernbench.ccl.install import load_ccl_config, resolve_algorithm_config
 from kernbench.ccl.sfr_config import (
    configure_sfr_intercube_multisip,
    configure_sfr_intracube_pe_ring,
 )
 from kernbench.policy.placement.dp import DPPolicy
 from kernbench.runtime_api.bench_runner import run_bench
 from kernbench.runtime_api.types import resolve_device
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 TOPOLOGY_PATH = Path(__file__).resolve().parents[2] / "topology.yaml"
 S_Q_PREFILL = 16
 S_Q_DECODE = 1
 S_KV_PER_RANK = 16
 H_Q = 1
 H_KV = 1
 D_HEAD = 64
 N_RANKS_SINGLE_USER = 8
 N_RANKS_MULTI_USER = 4
 DTYPE = "f16"
 # ── Helpers ──────────────────────────────────────────────────────
 def _engine_factory(t, d):
    return GraphEngine(getattr(t, "topology_obj", t), enable_data=True)
 def _run_panel(bench_fn):
    """Drive a panel through run_bench; return (exc, result, engine)."""
    topo = resolve_topology(str(TOPOLOGY_PATH))
    captured: dict = {"engine": None}
    def factory(t, d):
        eng = _engine_factory(t, d)
        captured["engine"] = eng
        return eng
    exc = None
    result = None
    try:
        result = run_bench(
            topology=topo, bench_fn=bench_fn,
            device=resolve_device(None), engine_factory=factory,
        )
    except BaseException as e:  # noqa: BLE001
        exc = e
    return exc, result, captured["engine"]
 def _assert_ok(name: str, exc, result, engine) -> None:
    if exc is not None:
        oplog_len = len(getattr(engine, "op_log", []) or []) if engine else 0
        print(f"\n========== {name} FAIL ==========")
        print(f"op_log records before crash: {oplog_len}")
        print(f"{type(exc).__name__}: {exc}")
        traceback.print_exception(type(exc), exc, exc.__traceback__)
        raise AssertionError(
            f"{name} failed at runtime: {exc}"
        ) from exc
    assert result is not None, f"{name}: no result"
    assert result.completion.ok, f"{name}: completion not ok — {result.completion}"
 # ── Panel bench fns ──────────────────────────────────────────────
 def _bench_fn_single_user_prefill(ctx):
    configure_sfr_intracube_pe_ring(
        ctx.engine, ctx.spec,
        resolve_algorithm_config(load_ccl_config(), name="lrab_hierarchical_allreduce"),
    )
    n = N_RANKS_SINGLE_USER
    dp_full = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=n)
    dp_kv = DPPolicy(cube="replicate", pe="row_wise", num_cubes=1, num_pes=n)
    q = ctx.zeros((S_Q_PREFILL, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="q")
    k = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="k")
    v = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="v")
    o = ctx.empty((S_Q_PREFILL, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="o")
    ctx.launch(
        "single_user_prefill_mesh", attention_mesh_kv_kernel,
        q, k, v, o,
        S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, n,
    )
 def _bench_fn_single_user_decode(ctx):
    configure_sfr_intracube_pe_ring(
        ctx.engine, ctx.spec,
        resolve_algorithm_config(load_ccl_config(), name="lrab_hierarchical_allreduce"),
    )
    n = N_RANKS_SINGLE_USER
    dp_full = DPPolicy(cube="replicate", pe="replicate", num_cubes=1, num_pes=n)
    dp_kv = DPPolicy(cube="replicate", pe="row_wise", num_cubes=1, num_pes=n)
    q = ctx.zeros((S_Q_DECODE, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="q")
    k = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="k")
    v = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="v")
    o = ctx.empty((S_Q_DECODE, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="o")
    ctx.launch(
        "single_user_decode_mesh", attention_mesh_mlo_kernel,
        q, k, v, o,
        S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, n,
    )
 def _bench_fn_multi_user_prefill(ctx):
    configure_sfr_intercube_multisip(
        ctx.engine, ctx.spec,
        resolve_algorithm_config(load_ccl_config(), name="lrab_hierarchical_allreduce"),
    )
    n = N_RANKS_MULTI_USER
    dp_full = DPPolicy(cube="replicate", pe="replicate", num_cubes=n, num_pes=8)
    dp_kv = DPPolicy(cube="row_wise", pe="replicate", num_cubes=n, num_pes=8)
    q = ctx.zeros((S_Q_PREFILL, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="q")
    k = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="k")
    v = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="v")
    o = ctx.empty((S_Q_PREFILL, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="o")
    ctx.launch(
        "multi_user_prefill_mesh", attention_mesh_kv_kernel,
        q, k, v, o,
        S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, n,
        1,  # rank_axis=1 → ring at cube level (ADR-0059 multi_user)
        _auto_dim_remap=False,
    )
 def _bench_fn_multi_user_decode(ctx):
    configure_sfr_intercube_multisip(
        ctx.engine, ctx.spec,
        resolve_algorithm_config(load_ccl_config(), name="lrab_hierarchical_allreduce"),
    )
    n = N_RANKS_MULTI_USER
    dp_full = DPPolicy(cube="replicate", pe="replicate", num_cubes=n, num_pes=8)
    dp_kv = DPPolicy(cube="row_wise", pe="replicate", num_cubes=n, num_pes=8)
    q = ctx.zeros((S_Q_DECODE, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="q")
    k = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="k")
    v = ctx.zeros((S_KV_PER_RANK * n, H_KV * D_HEAD), dtype=DTYPE, dp=dp_kv, name="v")
    o = ctx.empty((S_Q_DECODE, H_Q * D_HEAD), dtype=DTYPE, dp=dp_full, name="o")
    ctx.launch(
        "multi_user_decode_mesh", attention_mesh_mlo_kernel,
        q, k, v, o,
        S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, n,
        1,  # rank_axis=1 → ring at cube level (ADR-0059 multi_user)
        _auto_dim_remap=False,
    )
 # ── Tests ────────────────────────────────────────────────────────
 def test_single_user_prefill_through_engine():
    exc, result, engine = _run_panel(_bench_fn_single_user_prefill)
    _assert_ok("single_user_prefill", exc, result, engine)
 def test_single_user_decode_through_engine():
    exc, result, engine = _run_panel(_bench_fn_single_user_decode)
    _assert_ok("single_user_decode", exc, result, engine)
 def test_multi_user_prefill_through_engine():
    exc, result, engine = _run_panel(_bench_fn_multi_user_prefill)
    _assert_ok("multi_user_prefill", exc, result, engine)
 def test_multi_user_decode_through_engine():
    exc, result, engine = _run_panel(_bench_fn_multi_user_decode)
    _assert_ok("multi_user_decode", exc, result, engine)
@@ -0,0 +1,172 @@
 """Phase 1 spec test for ``rank_axis`` parameter on the two mesh kernels.
 ADR-0059's mesh kernels currently hard-code ``rank = tl.program_id(axis=0)``,
 which only works for single_user_* panels (rank == pe_id within cube).
 For multi_user_* panels the ring is at the cube level — rank should be
 ``cube_id`` (axis=1), and the 7 non-rank-leader PEs in each cube should
 not run the ring (they only hold KV replicas).
 This test pins the desired ``rank_axis`` kwarg semantics:
  rank_axis = 0 (default, single_user)
    rank = tl.program_id(axis=0). Every PE in the cube runs the ring.
    Existing behavior — no change.
  rank_axis = 1 (multi_user)
    if tl.program_id(axis=0) != 0: return.  (7/8 PEs early-exit.)
    rank = tl.program_id(axis=1).
 Phase 1 expectation: tests fail today (kernels don't accept the kwarg).
 Phase 2 lands the parameter on both kernels; tests turn green and the
 multi_user_* diag harness clears its first send.
 """
 from __future__ import annotations
 from kernbench.common.ipcq_types import IpcqRecvCmd, IpcqSendCmd
 from kernbench.common.pe_commands import GemmCmd
 from kernbench.triton_emu.tl_context import TLContext, run_kernel
 from kernbench.benches._attention_mesh_kv import attention_mesh_kv_kernel
 from kernbench.benches._attention_mesh_mlo import attention_mesh_mlo_kernel
 S_Q_PREFILL = 16
 S_Q_DECODE = 1
 S_KV_PER_RANK = 16
 H_Q = 1
 H_KV = 1
 D_HEAD = 64
 N_RANKS_MULTI = 4
 PES_PER_CUBE = 8
 Q_PTR = 0x10000
 K_PTR = 0x20000
 V_PTR = 0x30000
 O_PTR = 0x40000
 def _tl(pe_id: int, cube_id: int, num_pes: int, num_cubes: int) -> TLContext:
    return TLContext(
        pe_id=pe_id,
        num_programs=num_pes,
        cube_id=cube_id,
        num_cubes=num_cubes,
        dispatch_cycles=0,
        scratch_base=0x80000,
        scratch_size=1 << 20,
    )
 # ── Default rank_axis=0 backward-compat ──────────────────────────
 def test_mlo_kernel_default_rank_axis_zero_emits_commands_on_all_pes():
    """rank_axis defaults to 0 → kernel uses pe_id as rank, runs on every
    PE. Verify by running rank=3 (interior PE) in a single-cube 8-rank
    setup and asserting at least one GEMM and at least one IPCQ send
    are emitted (interior ranks send in both directions)."""
    tl = _tl(pe_id=3, cube_id=0, num_pes=8, num_cubes=1)
    run_kernel(
        attention_mesh_mlo_kernel, tl,
        Q_PTR, K_PTR, V_PTR, O_PTR,
        S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, 8,
    )
    assert any(isinstance(c, GemmCmd) for c in tl.commands), \
        "default rank_axis=0 must run the kernel (≥1 GEMM)"
    assert any(isinstance(c, IpcqSendCmd) for c in tl.commands), \
        "interior rank must emit ≥1 IpcqSendCmd"
 def test_kv_kernel_default_rank_axis_zero_emits_commands_on_all_pes():
    tl = _tl(pe_id=3, cube_id=0, num_pes=8, num_cubes=1)
    run_kernel(
        attention_mesh_kv_kernel, tl,
        Q_PTR, K_PTR, V_PTR, O_PTR,
        S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, 8,
    )
    assert any(isinstance(c, GemmCmd) for c in tl.commands)
    assert any(isinstance(c, IpcqSendCmd) for c in tl.commands)
 # ── rank_axis=1 multi_user semantics ─────────────────────────────
 def test_mlo_kernel_rank_axis_one_gates_non_zero_pe_to_no_commands():
    """rank_axis=1 + pe_id != 0 → kernel must early-return; no GEMM,
    no DMA, no IPCQ. The 7 non-rank-leader PEs in a multi_user cube
    must stay completely silent so the cube-level SFR install isn't
    asked to route sends from PEs that have no neighbors installed."""
    tl = _tl(pe_id=2, cube_id=1, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
    run_kernel(
        attention_mesh_mlo_kernel, tl,
        Q_PTR, K_PTR, V_PTR, O_PTR,
        S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
        rank_axis=1,
    )
    assert not any(isinstance(c, GemmCmd) for c in tl.commands), \
        "pe_id=2 with rank_axis=1 must not emit GEMMs"
    assert not any(isinstance(c, IpcqSendCmd) for c in tl.commands), \
        "pe_id=2 with rank_axis=1 must not emit IpcqSendCmd"
    assert not any(isinstance(c, IpcqRecvCmd) for c in tl.commands), \
        "pe_id=2 with rank_axis=1 must not emit IpcqRecvCmd"
 def test_kv_kernel_rank_axis_one_gates_non_zero_pe_to_no_commands():
    tl = _tl(pe_id=2, cube_id=1, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
    run_kernel(
        attention_mesh_kv_kernel, tl,
        Q_PTR, K_PTR, V_PTR, O_PTR,
        S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
        rank_axis=1,
    )
    assert not any(isinstance(c, GemmCmd) for c in tl.commands)
    assert not any(isinstance(c, IpcqSendCmd) for c in tl.commands)
    assert not any(isinstance(c, IpcqRecvCmd) for c in tl.commands)
 def test_mlo_kernel_rank_axis_one_pe_zero_uses_cube_id_as_rank():
    """rank_axis=1 + pe_id == 0 → kernel runs the ring with rank=cube_id.
    For cube_id=1 in a 4-cube ring, rank=1 is an interior rank: has_E=True
    AND has_W=True → IPCQ sends emitted in both E and W directions.
    """
    tl = _tl(pe_id=0, cube_id=1, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
    run_kernel(
        attention_mesh_mlo_kernel, tl,
        Q_PTR, K_PTR, V_PTR, O_PTR,
        S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
        rank_axis=1,
    )
    sends = [c for c in tl.commands if isinstance(c, IpcqSendCmd)]
    assert any(s.direction == "E" for s in sends), \
        "cube_id=1 (interior) must emit ≥1 E-send"
    assert any(s.direction == "W" for s in sends), \
        "cube_id=1 (interior) must emit ≥1 W-send"
 def test_kv_kernel_rank_axis_one_pe_zero_uses_cube_id_as_rank():
    tl = _tl(pe_id=0, cube_id=1, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
    run_kernel(
        attention_mesh_kv_kernel, tl,
        Q_PTR, K_PTR, V_PTR, O_PTR,
        S_Q_PREFILL, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
        rank_axis=1,
    )
    sends = [c for c in tl.commands if isinstance(c, IpcqSendCmd)]
    assert any(s.direction == "E" for s in sends)
    assert any(s.direction == "W" for s in sends)
 def test_mlo_kernel_rank_axis_one_west_edge_cube_no_west_sends():
    """cube_id=0 (west edge) with rank_axis=1: rank=0, has_W=False → no
    W-direction IPCQ sends. has_E=True → ≥1 E-direction send."""
    tl = _tl(pe_id=0, cube_id=0, num_pes=PES_PER_CUBE, num_cubes=N_RANKS_MULTI)
    run_kernel(
        attention_mesh_mlo_kernel, tl,
        Q_PTR, K_PTR, V_PTR, O_PTR,
        S_Q_DECODE, S_KV_PER_RANK, H_Q, H_KV, D_HEAD, N_RANKS_MULTI,
        rank_axis=1,
    )
    sends = [c for c in tl.commands if isinstance(c, IpcqSendCmd)]
    assert any(s.direction == "E" for s in sends), \
        "west-edge cube_id=0 must still emit ≥1 E-send"
    assert not any(s.direction == "W" for s in sends), \
        "west-edge cube_id=0 must NOT emit any W-send (no W neighbor)"
@@ -1,283 +1,31 @@
-"""Shared plotting plumbing for the GEMM figure tests.
+"""Thin re-export shim for the GEMM figure tests.
-Not a test module (no ``test_`` prefix -> pytest does not collect it).
+Not a test module (no ``test_`` prefix → pytest does not collect it).
-Reads the committed ``docs/diagrams/gemm_sweep.json`` (produced by the heavy
+The sweep + renderer logic now lives in
-``scripts/gemm_sweep.py`` sim sweep) and renders matplotlib PNGs into
+``kernbench.benches.milestone_1h_gemm`` (production single home, ADR-0054,
-``docs/diagrams/gemm_plots/``. No simulation here -> the figure tests are fast
+also driven by ``scripts/gemm_sweep.py``). The figure tests import the same
-and run by default; regenerating the underlying data stays a manual script.
+names from here; behavior is unchanged (defaults still target
-
+``docs/diagrams/gemm_plots/``).
 Chart set (mirrors the GEMM MAC slides in scripts/build_overview_slides.py):
  - stage breakdown (load_ref operand staging)
  - MAC utilization — measured (load_ref)
  - MAC utilization — theoretical vs measured (load_ref)
 """
 from __future__ import annotations
-import json
+from kernbench.benches.milestone_1h_gemm import (
-from pathlib import Path
+    DEFAULT_PLOTS_DIR as GEMM_PLOTS_DIR,
    DEFAULT_SWEEP_JSON as GEMM_SWEEP_JSON,
    ROOT,
    emit_all_gemm_plots,
    emit_mac_utilization_measured,
    emit_mac_utilization_theoretical_vs_measured,
    emit_stage_breakdown,
 )
-ROOT = Path(__file__).resolve().parent.parent.parent
+__all__ = [
-GEMM_SWEEP_JSON = ROOT / "docs" / "diagrams" / "gemm_sweep.json"
+    "GEMM_PLOTS_DIR",
-GEMM_PLOTS_DIR = ROOT / "docs" / "diagrams" / "gemm_plots"
+    "GEMM_SWEEP_JSON",
-
+    "ROOT",
-# Shapes excluded from the figures (mirrors build_overview_slides).
+    "emit_all_gemm_plots",
-EXCLUDED_SHAPES = {(512, 512, 512)}
+    "emit_mac_utilization_measured",
-
+    "emit_mac_utilization_theoretical_vs_measured",
-# Stage bars shown (raw op_log stage_type keys) + display names + colors.
+    "emit_stage_breakdown",
-STAGE_KEYS = ["DMA_READ", "FETCH", "GEMM", "DMA_WRITE"]
+]
 STAGE_DISPLAY = {
    "DMA_READ":  "DMA in",
    "FETCH":     "Fetch",
    "GEMM":      "GEMM",
    "DMA_WRITE": "DMA out",
 }
 STAGE_COLORS = {
    "DMA_READ":  "#3B82F6",
    "FETCH":     "#10B981",
    "GEMM":      "#F59E0B",
    "DMA_WRITE": "#A855F7",
 }
 # MAC-utilization model constants (mirror build_overview_slides).
 _HBM_GBS = 256.0
 _BPE = 2
 _T_STAGE = 16.0
 _D_STAGES = 3
 _PLOT_VARIANT = "load_ref"
 def _load_sweep_data() -> dict:
    if not GEMM_SWEEP_JSON.exists():
        return {"rows": []}
    data = json.loads(GEMM_SWEEP_JSON.read_text())
    data["rows"] = [
        r for r in data.get("rows", [])
        if (r["M"], r["K"], r["N"]) not in EXCLUDED_SHAPES
    ]
    return data
 def _shape_label(r: dict) -> str:
    if r["M"] == r["K"] == r["N"]:
        return f"M=K=N={r['M']}"
    return f"M={r['M']} K={r['K']} N={r['N']}"
 def _under_tile(M, K, N, tile_M, tile_K, tile_N) -> bool:
    return M < tile_M or K < tile_K or N < tile_N
 def _xtick_labels(shape_labels, tile_counts, flagged) -> list[str]:
    out = []
    for lbl, tc, fl in zip(shape_labels, tile_counts, flagged):
        s = f"{lbl}\n({tc} tiles)"
        if fl:
            s += " *"
        out.append(s)
    return out
 def _grouped_bar_png(
    out_name: str, *, title: str, subtitle: str | None,
    shape_labels, tile_counts, flagged, series: dict, colors: dict,
    y_label: str, threshold: float | None = None, footnote: str | None = None,
 ) -> str:
    """Render one grouped-bar chart to GEMM_PLOTS_DIR/out_name; return the path."""
    import matplotlib.pyplot as plt
    import numpy as np
    n_groups = len(shape_labels)
    n_series = max(1, len(series))
    x = np.arange(n_groups)
    width = 0.8 / n_series
    fig, ax = plt.subplots(figsize=(11, 6))
    for i, (name, vals) in enumerate(series.items()):
        offset = (i - (n_series - 1) / 2) * width
        ax.bar(x + offset, vals, width, label=name, color=colors.get(name))
    ax.set_xticks(x)
    ax.set_xticklabels(
        _xtick_labels(shape_labels, tile_counts, flagged), fontsize=8,
    )
    ax.set_ylabel(y_label)
    ax.set_title(title, fontsize=13, fontweight="bold")
    if subtitle:
        ax.text(0.5, 1.01, subtitle, transform=ax.transAxes, ha="center",
                va="bottom", fontsize=8, color="#475569")
    if threshold is not None:
        ax.axhline(threshold, ls="--", color="gray", lw=1.0)
    ax.legend(fontsize=8, loc="upper right")
    ax.grid(True, axis="y", alpha=0.3)
    caption = "* = under-tile shape (M<TILE_M, K<TILE_K, or N<TILE_N)"
    if footnote:
        caption = footnote + "\n" + caption
    fig.text(0.5, 0.01, caption, ha="center", fontsize=7, color="gray",
             wrap=True)
    fig.tight_layout(rect=(0, 0.05, 1, 1))
    GEMM_PLOTS_DIR.mkdir(parents=True, exist_ok=True)
    out = GEMM_PLOTS_DIR / out_name
    fig.savefig(out, dpi=120)
    plt.close(fig)
    return str(out)
 # ── individual chart renderers (read sweep JSON, emit one PNG each) ─────
 def emit_stage_breakdown() -> str | None:
    """Per-stage engine wall-clock per shape (load_ref operand staging)."""
    data = _load_sweep_data()
    rows = [r for r in data["rows"] if r.get("variant") == _PLOT_VARIANT]
    if not rows:
        return None
    tile = data["tile_sizes"]
    shape_labels = [_shape_label(r) for r in rows]
    flagged = [_under_tile(r["M"], r["K"], r["N"], tile["M"], tile["K"], tile["N"])
               for r in rows]
    tile_counts = [r["tile_count_expected"] for r in rows]
    series = {
        STAGE_DISPLAY[s]: [r.get("stages", {}).get(s, {}).get("wall_ns", 0.0)
                           for r in rows]
        for s in STAGE_KEYS
    }
    colors = {STAGE_DISPLAY[s]: STAGE_COLORS[s] for s in STAGE_KEYS}
    return _grouped_bar_png(
        "gemm_stage_breakdown.png",
        title="GEMM stage breakdown",
        subtitle=(f"Per-stage engine wall-clock (DMA in / Fetch / GEMM / "
                  f"DMA out), {_PLOT_VARIANT} staging. "
                  f"Tile {tile['M']}x{tile['K']}x{tile['N']}."),
        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
        series=series, colors=colors, y_label="ns",
        footnote="Bars = engine wall-clock interval (merged overlaps).",
    )
 def emit_mac_utilization_measured() -> str | None:
    """GEMM util % and useful pipeline-eff % (analytical model, load_ref)."""
    data = _load_sweep_data()
    rows = data["rows"]
    if not rows:
        return None
    tile = data["tile_sizes"]
    TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
    tile_flops = 2 * TILE_M * TILE_K * TILE_N
    dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
    head_ns = (_D_STAGES - 1) * _T_STAGE
    by_shape = {(r["M"], r["K"], r["N"]): r
                for r in rows if r["variant"] == _PLOT_VARIANT}
    shapes = list(by_shape)
    if not shapes:
        return None
    shape_labels = [_shape_label(by_shape[k]) for k in shapes]
    flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
    tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
    gemm_util, useful_eff = [], []
    for k in shapes:
        r = by_shape[k]
        M, K, N = r["M"], r["K"], r["N"]
        useful = 2 * M * K * N
        tiles = r["tile_count_expected"]
        gu = useful / (tile_flops * tiles) * 100
        gemm_util.append(gu)
        m_tiles = (M + TILE_M - 1) // TILE_M
        n_tiles = (N + TILE_N - 1) // TILE_N
        n_mn = m_tiles * n_tiles
        compute_total = tiles * _T_STAGE
        wall = head_ns + tiles * _T_STAGE + max(0, n_mn - 1) * dma_w_per_pair
        ueff = (compute_total * (gu / 100.0) / wall) * 100 if wall > 0 else 0.0
        useful_eff.append(ueff)
    series = {"GEMM util %": gemm_util, "Useful eff %": useful_eff}
    colors = {"GEMM util %": "#10B981", "Useful eff %": "#F59E0B"}
    return _grouped_bar_png(
        "gemm_mac_utilization_measured.png",
        title="GEMM MAC utilization — load_ref",
        subtitle=("GEMM util = useful FLOPs / (tile FLOPs x tiles); "
                  "Useful eff = GEMM util x ideal pipeline efficiency."),
        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
        series=series, colors=colors, y_label="%", threshold=100.0,
        footnote="Theoretical ideal-pipeline model (not simulator data).",
    )
 def emit_mac_utilization_theoretical_vs_measured() -> str | None:
    """Theoretical vs simulator-measured GEMM util / useful eff (load_ref)."""
    data = _load_sweep_data()
    rows = data["rows"]
    if not rows:
        return None
    tile = data["tile_sizes"]
    TILE_M, TILE_K, TILE_N = tile["M"], tile["K"], tile["N"]
    tile_flops = 2 * TILE_M * TILE_K * TILE_N
    dma_w_per_pair = (TILE_M * TILE_N * _BPE) / _HBM_GBS
    head_ns = (_D_STAGES - 1) * _T_STAGE
    peak_per_ns = tile_flops / _T_STAGE
    by_shape = {(r["M"], r["K"], r["N"]): r
                for r in rows if r["variant"] == _PLOT_VARIANT}
    shapes = list(by_shape)
    if not shapes:
        return None
    shape_labels = [_shape_label(by_shape[k]) for k in shapes]
    flagged = [_under_tile(*k, TILE_M, TILE_K, TILE_N) for k in shapes]
    tile_counts = [by_shape[k]["tile_count_expected"] for k in shapes]
    gu_t, gu_m, eff_t, eff_m = [], [], [], []
    for k in shapes:
        r = by_shape[k]
        M, K, N = r["M"], r["K"], r["N"]
        useful = 2 * M * K * N
        tiles = r["tile_count_expected"]
        gut = useful / (tile_flops * tiles)
        gu_t.append(gut * 100)
        rec = r.get("stages", {}).get("GEMM", {}).get("record_count", 0) or tiles
        gu_m.append((useful / (tile_flops * rec) * 100) if rec else 0.0)
        m_tiles = (M + TILE_M - 1) // TILE_M
        n_tiles = (N + TILE_N - 1) // TILE_N
        n_mn = m_tiles * n_tiles
        compute_total = tiles * _T_STAGE
        wall_t = head_ns + compute_total + max(0, n_mn - 1) * dma_w_per_pair
        eff_t.append((compute_total * gut / wall_t * 100) if wall_t > 0 else 0.0)
        cw = r.get("composite_window_ns", 0.0) or 0.0
        eff_m.append((useful / cw / peak_per_ns * 100) if cw > 0 else 0.0)
    series = {
        "GEMM util % (theoretical)": gu_t,
        "GEMM util % (measured)":    gu_m,
        "Theoretical eff %":         eff_t,
        "Measured eff %":            eff_m,
    }
    colors = {
        "GEMM util % (theoretical)": "#10B981",
        "GEMM util % (measured)":    "#6EE7B7",
        "Theoretical eff %":         "#F59E0B",
        "Measured eff %":            "#3B82F6",
    }
    return _grouped_bar_png(
        "gemm_mac_utilization_theoretical_vs_measured.png",
        title="GEMM MAC utilization — theoretical vs measured (load_ref)",
        subtitle=("theoretical model vs simulator op_log; agreement "
                  "validates the analytical pipeline model."),
        shape_labels=shape_labels, tile_counts=tile_counts, flagged=flagged,
        series=series, colors=colors, y_label="%", threshold=100.0,
    )
 def emit_all_gemm_plots() -> list[str]:
    """Render every GEMM figure that has data; return the list of paths written."""
    paths = []
    for fn in (emit_stage_breakdown,
               emit_mac_utilization_measured,
               emit_mac_utilization_theoretical_vs_measured):
        p = fn()
        if p:
            paths.append(p)
    return paths
@@ -0,0 +1,131 @@
 """Phase 1 spec test for ``ctx.launch`` dim-translation bugs surfaced by
 the multi_user_* panels of milestone-gqa-llama70b (sub-cycle 4c step 2).
 The default ``topology.yaml`` has 4×4 = 16 cubes per SIP, so
 ``RuntimeContext._num_cubes == 16``. Multi-user attention panels run a
 4-cube ring (validation scale) by passing ``DPPolicy(num_cubes=4)``.
 Two bugs in ``ctx.launch`` make this combination silently produce wrong
 kernel arguments:
 Bug A — _compute_local_shape ignores DPPolicy.num_cubes
    ``_compute_local_shape`` in ``ctx.launch`` divides by
    ``self._num_cubes`` (the topology's cube count, 16) instead of the
    DPPolicy's effective ``num_cubes`` (4). So a ``(M=80, K=64)`` tensor
    sharded ``cube="row_wise"`` with ``DPPolicy(num_cubes=4)`` produces
    a local M of ``80 // 16 = 5``, not the kernel-expected ``80 // 4 = 20``.
    Note: tensor allocation already honors ``dp.num_cubes`` correctly at
    [context.py:471-484](src/kernbench/runtime_api/context.py#L471-L484);
    the bug is the parallel computation inside ``launch`` is out of sync.
 Bug B — scalar args coincidentally equal to a global tensor dim get auto-remapped
    The dim_map at [context.py:712-770](src/kernbench/runtime_api/context.py#L712-L770)
    is keyed by *value*, so any scalar whose value coincides with a
    global tensor dim gets rewritten to that dim's local value — even
    when the scalar is unrelated. ``d_head=64`` coincides with the
    multi_user K's global M = ``S_kv_per_rank * n = 16 * 4 = 64``, so
    the kernel receives ``d_head = 16`` (the post-Bug-A local) or
    ``d_head = 4`` (the pre-Bug-A local) instead of ``64``.
    Legacy bench kernels rely on auto-remap (e.g. ``test_va_offset.py``
    passes global N and expects the kernel to see local N). The fix is
    opt-out, not removal: ``ctx.launch(..., _auto_dim_remap=False)``
    preserves scalars exactly as passed, default behavior unchanged.
 Both tests fail today. Phase 2 fixes them in [src/kernbench/runtime_api/context.py](src/kernbench/runtime_api/context.py).
 """
 from __future__ import annotations
 from pathlib import Path
 from kernbench.policy.placement.dp import DPPolicy
 from kernbench.runtime_api.context import RuntimeContext
 from kernbench.runtime_api.types import DeviceSelector
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import load_topology
 TOPOLOGY_PATH = Path(__file__).parent.parent / "topology.yaml"
 def _make_ctx(corr_id: str) -> RuntimeContext:
    graph = load_topology(TOPOLOGY_PATH)
    engine = GraphEngine(graph)
    return RuntimeContext(
        engine=engine, target_device=DeviceSelector("sip:0"),
        correlation_id=corr_id, spec=graph.spec,
    )
 def test_topology_num_cubes_is_16_baseline_assumption():
    """Sanity: confirm the topology this test assumes (16 cubes per SIP).
    If this fails, recheck the topology.yaml cube_mesh setting before
    interpreting the other failures below. ``_num_cubes`` is initialized
    lazily by ``_ensure_allocators`` on first tensor op, so trigger it."""
    ctx = _make_ctx("dim-baseline")
    ctx._ensure_allocators()
    assert ctx._num_cubes == 16, (
        f"expected default topology.yaml to give 16 cubes per SIP, "
        f"got {ctx._num_cubes}"
    )
 def test_ctx_launch_local_shape_honors_dppolicy_num_cubes():
    """Bug A. ``DPPolicy(num_cubes=4)`` must be the divisor for
    row_wise sharding inside ctx.launch's dim_map, not the topology's 16.
    Setup: K-like tensor with M_global = 80 (cleanly divisible by both
    4 and 16, distinct local values 20 vs 5). Pass M_global as a kernel
    scalar; the kernel records what it received. With correct dim_map,
    scalar 80 is remapped to 20 (80 / dp.num_cubes). With current code,
    it is remapped to 5 (80 / self._num_cubes = 16).
    """
    captured: dict[str, int] = {}
    def _kernel(t, m_scalar, *, tl):  # noqa: ARG001
        captured["m_scalar"] = int(m_scalar)
    ctx = _make_ctx("dim-bugA")
    dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=4, num_pes=8)
    t = ctx.zeros((80, 64), dtype="f16", dp=dp, name="t80x64")
    ctx.launch("bugA_capture", _kernel, t, 80)
    ctx.wait_all()
    assert "m_scalar" in captured, "kernel was not invoked"
    assert captured["m_scalar"] == 20, (
        f"expected dim_map to divide 80 by dp.num_cubes=4 → 20; "
        f"got {captured['m_scalar']} (likely divided by topology cubes=16)"
    )
 def test_ctx_launch_scalar_passed_through_when_auto_remap_disabled():
    """Bug B. Scalars must not be silently remapped when their value
    happens to equal a tensor's global dim — at minimum the caller must
    have an opt-out.
    Setup: K-like tensor with M_global = 64 row_wise. Pass d_head = 64
    as a scalar (semantically unrelated to K's M, but coincidentally
    equal). The kernel records d_head. With ``_auto_dim_remap=False``
    on ctx.launch, d_head must stay 64.
    Today: ``_auto_dim_remap`` kwarg doesn't exist → TypeError. After
    Phase 2: kwarg exists, defaults to True (legacy unchanged); passing
    False preserves the scalar.
    """
    captured: dict[str, int] = {}
    def _kernel(t, d_head, *, tl):  # noqa: ARG001
        captured["d_head"] = int(d_head)
    ctx = _make_ctx("dim-bugB")
    dp = DPPolicy(cube="row_wise", pe="replicate", num_cubes=4, num_pes=8)
    t = ctx.zeros((64, 64), dtype="f16", dp=dp, name="t64x64")
    ctx.launch(
        "bugB_capture", _kernel, t, 64,
        _auto_dim_remap=False,
    )
    ctx.wait_all()
    assert captured.get("d_head") == 64, (
        f"expected d_head scalar to pass through unchanged when "
        f"_auto_dim_remap=False; got {captured.get('d_head')!r}"
    )
@@ -0,0 +1,77 @@
 """Milestone benches: registration + figure/result generation (ADR-0054).
 ``milestone-1h-gemm`` / ``milestone-1h-ccl`` are eval benches: run via the
 normal ``run_bench`` path, they regenerate every GEMM / allreduce figure +
 CSV into ``benches/1H_milestone_output/{gemm,ccl}/``. The GEMM bench in
 ``MILESTONE_FAST=1`` mode just re-renders the committed sweep JSON (fast,
 default-run here); the CCL bench drives both full sweeps (slow, opt-in).
 """
 from __future__ import annotations
 import re
 from pathlib import Path
 import pytest
 from kernbench.benches.registry import resolve
 from kernbench.runtime_api.bench_runner import run_bench
 from kernbench.runtime_api.types import resolve_device
 from kernbench.sim_engine.engine import GraphEngine
 from kernbench.topology.builder import resolve_topology
 import kernbench.benches.milestone_1h_ccl as ccl_bench
 import kernbench.benches.milestone_1h_gemm as gemm_bench
 _NAME_RE = re.compile(r"^[a-z][a-z0-9]*(-[a-z0-9]+)*$")
 def _run(name: str):
    topo = resolve_topology("topology.yaml")
    return run_bench(
        topology=topo, bench_fn=resolve(name).run, device=resolve_device(None),
        engine_factory=lambda t, d: GraphEngine(
            getattr(t, "topology_obj", t), enable_data=True,
        ),
    )
 def test_milestone_benches_registered():
    for name in ("milestone-1h-gemm", "milestone-1h-ccl"):
        spec = resolve(name)
        assert spec.name == name
        assert _NAME_RE.match(spec.name)
        assert spec.description.strip()
@pytest.mark.skipif(
    not gemm_bench.DEFAULT_SWEEP_JSON.exists(),
    reason="gemm_sweep.json absent; run scripts/gemm_sweep.py first",
 )
 def test_milestone_gemm_fast_generates_figures(monkeypatch):
    monkeypatch.setenv("MILESTONE_FAST", "1")
    result = _run("milestone-1h-gemm")
    assert result.completion.ok, result.completion
    out = gemm_bench._OUTPUT_DIR
    for png in (
        "gemm_stage_breakdown.png",
        "gemm_mac_utilization_measured.png",
        "gemm_mac_utilization_theoretical_vs_measured.png",
    ):
        assert (out / png).exists(), f"missing {png}"
@pytest.mark.slow
 def test_milestone_ccl_generates_figures():
    result = _run("milestone-1h-ccl")
    assert result.completion.ok, result.completion
    out = ccl_bench._OUTPUT_DIR
    for artifact in (
        "summary.csv",
        "topology.png",
        "comparison_mesh_vs_ring_vs_2DTorus_vs_theoretical_vs_fsim.png",
        "AllReduce_LRAB_2Dtorus_6SiP_2x3_with_TCM_SRAM_HBM.png",
        "AllReduce_LRAB_Ring1D_6SiP_1x6.png",
        "AllReduce_LRAB_2Dtorus_6SiP_2x3.png",
        "AllReduce_LRAB_2DMesh_6SiP_2x3.png",
    ):
        assert (out / artifact).exists(), f"missing {artifact}"
@@ -0,0 +1,218 @@
 """Phase 1 spec test for the math-input snapshot race (IPCQ slot wrap).
 Context (sub-cycle 4c.0 diagnostic):
  The mesh decode kernel (_attention_mesh_mlo.py) issues many tl.recv()
  calls against an IPCQ ring of ~8 slots. With n_ranks=8 and bidirectional
  fan-out, each PE issues 3 recvs per step × 7 steps × 2 directions =
  42 recvs per panel. The IPCQ slot index is ``my_tail % n_slots``, so
  the ring wraps and a fresh recv overwrites a slot whose data a prior
  math op had not yet snapshotted.
  OpLogger.record_end currently snapshots math inputs by re-reading
  MemoryStore at record_end time (op_log.py:97-113). When a later recv
  has overwritten the input addr with a DIFFERENT-shape array between
  record_start and record_end, MemoryStore.read raises
  ``Shape mismatch: stored (16, 64) vs requested (16, 1)`` and the
  snapshot becomes None (or, in Phase 2 replay, surfaces the same
  exception in DataExecutor).
 Phase 1 expectation: this test currently fails. It asserts the
 *desired* behavior: when the math input TensorHandle carries a
 .data snapshot (captured at recv time before the slot was wrapped),
 OpLogger MUST prefer that snapshot over MemoryStore.read.
 After Phase 2 (snapshot propagation fix), this test passes — and the
 sub-cycle 4c.0 mesh decode end-to-end (test_attention_mesh_decode_diag
 and test_milestone_gqa_llama70b) passes for the same reason.
 See: docs/adr/ADR-0020 (two-phase execution),
     docs/adr/ADR-0023 (IPCQ ring slots),
     docs/adr/ADR-0027 (snapshot discipline for dma_write).
 """
 from __future__ import annotations
 import numpy as np
 from kernbench.common.pe_commands import MathCmd, TensorHandle
 from kernbench.sim_engine.memory_store import MemoryStore
 from kernbench.sim_engine.op_log import OpLogger
 # ── Helpers ──────────────────────────────────────────────────────
 def _slot_handle(addr: int, shape: tuple[int, ...], dtype: str,
                 data: np.ndarray | None) -> TensorHandle:
    """Build a TensorHandle as tl.recv() would: addr=slot, .data=snapshot."""
    nbytes = int(np.prod(shape)) * np.dtype(
        {"f16": np.float16, "f32": np.float32}[dtype]
    ).itemsize
    return TensorHandle(
        id=f"slot_{addr:x}", addr=addr, shape=shape, dtype=dtype,
        nbytes=nbytes, data=data, space="tcm",
    )
 def _out_handle(addr: int, shape: tuple[int, ...], dtype: str) -> TensorHandle:
    nbytes = int(np.prod(shape)) * np.dtype(
        {"f16": np.float16, "f32": np.float32}[dtype]
    ).itemsize
    return TensorHandle(
        id=f"out_{addr:x}", addr=addr, shape=shape, dtype=dtype,
        nbytes=nbytes, data=None, space="tcm",
    )
 # ── Tests ─────────────────────────────────────────────────────────
 def test_math_snapshot_lost_when_input_slot_overwritten_with_same_nbytes():
    """Baseline (passes today): if a later write at the input addr has the
    SAME nbytes as the math input's expected shape, MemoryStore.read
    returns the LATER data — the snapshot is silently wrong. This is the
    quiet variant of the bug; it does not raise, it just produces
    incorrect numerical output in Phase 2.
    This test documents that the current OpLogger behavior is wrong even
    when shapes coincidentally match. The Phase 2 fix removes this
    silent-corruption mode by preferring handle.data.
    """
    store = MemoryStore()
    slot_addr = 0x3000
    # Original at recv time: filled with 7s.
    original = np.full((16, 1), 7.0, dtype=np.float16)
    store.write("tcm", slot_addr, original)
    inp = _slot_handle(slot_addr, (16, 1), "f16", data=original.copy())
    out = _out_handle(0x4000, (16, 1), "f16")
    cmd = MathCmd(op="maximum", inputs=(inp,), out=out)
    logger = OpLogger(memory_store=store)
    logger.record_start(10.0, "sip0.cube0.pe0.pe_math", cmd)
    # SIMULATE: a later recv writes a DIFFERENT array at the same slot
    # (same nbytes as (16,1), so MemoryStore.read does not raise).
    later = np.full((16, 1), 99.0, dtype=np.float16)
    store.write("tcm", slot_addr, later)
    logger.record_end(15.0, "sip0.cube0.pe0.pe_math", cmd)
    snap = logger.records[0].params["input_snapshots"][0]
    assert snap is not None
    # Desired post-fix behavior: snapshot equals ``original``.
    # Today: snapshot equals ``later`` — silent corruption.
    np.testing.assert_array_equal(snap, original)
 def test_math_snapshot_survives_input_slot_wrap_with_different_shape():
    """The hard-failure variant: a later recv overwrites the input slot
    with a DIFFERENT-shape array (different nbytes), so MemoryStore.read
    at record_end raises and the snapshot becomes None. Phase 2 replay
    then surfaces this as the (16, 64) vs (16, 1) crash seen in
    test_attention_mesh_decode_diag.
    Desired behavior: handle.data carries the recv-time snapshot, so
    OpLogger never has to look at MemoryStore for this input → no race,
    snapshot is correct.
    """
    store = MemoryStore()
    slot_addr = 0x3000
    # Original at recv time: an (m, ℓ) reduction result, shape (16, 1).
    original = np.full((16, 1), 7.0, dtype=np.float16)
    store.write("tcm", slot_addr, original)
    inp = _slot_handle(slot_addr, (16, 1), "f16", data=original.copy())
    out = _out_handle(0x4000, (16, 1), "f16")
    cmd = MathCmd(op="maximum", inputs=(inp,), out=out)
    logger = OpLogger(memory_store=store)
    logger.record_start(10.0, "sip0.cube0.pe0.pe_math", cmd)
    # SIMULATE the slot-wrap race: a later recv (an o triplet, shape
    # (16, 64)) writes the same TCM slot. MemoryStore.read for shape
    # (16, 1) now raises ValueError("Shape mismatch ...").
    overwrite = np.full((16, 64), 99.0, dtype=np.float16)
    store.write("tcm", slot_addr, overwrite)
    logger.record_end(15.0, "sip0.cube0.pe0.pe_math", cmd)
    snap = logger.records[0].params["input_snapshots"][0]
    # Today: snap is None (read raised, except branch returned None).
    # Post-fix: handle.data preferred → snap is original.
    assert snap is not None, (
        "input snapshot was lost when the recv slot was wrapped — "
        "OpLogger must prefer handle.data over MemoryStore.read for "
        "math inputs whose handle carries a .data snapshot"
    )
    assert snap.shape == (16, 1)
    np.testing.assert_array_equal(snap, original)
 def test_math_snapshot_handle_data_with_multiple_inputs():
    """maximum/binary math has 2 inputs; both must use their carried
    snapshots independently (e.g. m_running merged with m_from_W where
    only m_from_W came from a recv slot)."""
    store = MemoryStore()
    # Input 0: a running m value held in PE scratch (no .data; OpLogger
    # falls back to MemoryStore.read as today). Its addr is stable —
    # not subject to the slot-wrap race.
    scratch_addr = 0x5000
    m_running = np.full((16, 1), 3.0, dtype=np.float16)
    store.write("tcm", scratch_addr, m_running)
    inp0 = _slot_handle(scratch_addr, (16, 1), "f16", data=None)
    # Input 1: m_from_W via tl.recv — carries snapshot in .data, addr
    # is the recv slot which WILL be wrapped before record_end.
    slot_addr = 0x3000
    m_from_W = np.full((16, 1), 7.0, dtype=np.float16)
    store.write("tcm", slot_addr, m_from_W)
    inp1 = _slot_handle(slot_addr, (16, 1), "f16", data=m_from_W.copy())
    out = _out_handle(0x4000, (16, 1), "f16")
    cmd = MathCmd(op="maximum", inputs=(inp0, inp1), out=out)
    logger = OpLogger(memory_store=store)
    logger.record_start(10.0, "sip0.cube0.pe0.pe_math", cmd)
    # Slot 0x3000 gets wrapped by a later recv with a different shape.
    overwrite = np.full((16, 64), 99.0, dtype=np.float16)
    store.write("tcm", slot_addr, overwrite)
    logger.record_end(15.0, "sip0.cube0.pe0.pe_math", cmd)
    snaps = logger.records[0].params["input_snapshots"]
    assert len(snaps) == 2
    # Input 0 (no carried snapshot, addr stable): MemoryStore read still
    # works. This must keep working post-fix.
    assert snaps[0] is not None
    np.testing.assert_array_equal(snaps[0], m_running)
    # Input 1 (carried snapshot, slot wrapped): must come from .data.
    assert snaps[1] is not None
    assert snaps[1].shape == (16, 1)
    np.testing.assert_array_equal(snaps[1], m_from_W)
 def test_math_snapshot_falls_back_to_memory_store_when_handle_data_is_none():
    """Backward-compat: handles with .data=None must continue to use
    MemoryStore.read as today. Most math inputs (intermediate results
    from local tl.dot / tl.exp etc.) have data=None and their TCM addrs
    are stable for the kernel's lifetime."""
    store = MemoryStore()
    addr = 0x6000
    arr = np.full((8, 8), 2.0, dtype=np.float16)
    store.write("tcm", addr, arr)
    inp = _slot_handle(addr, (8, 8), "f16", data=None)
    out = _out_handle(0x7000, (8, 8), "f16")
    cmd = MathCmd(op="exp", inputs=(inp,), out=out)
    logger = OpLogger(memory_store=store)
    logger.record_start(10.0, "sip0.cube0.pe0.pe_math", cmd)
    logger.record_end(15.0, "sip0.cube0.pe0.pe_math", cmd)
    snap = logger.records[0].params["input_snapshots"][0]
    assert snap is not None
    np.testing.assert_array_equal(snap, arr)
@@ -0,0 +1,334 @@
 """Generate docs/adr/INDEX.md (and docs/adr-ko/INDEX.md) from the ADR corpus.
 Auto-derives a section-based index following the same classification as
 the /report skill — Design Principles / High-level Architecture /
 Detailed Architecture (by component) / Implementation Decisions
 (by topic). Run before publishing to refresh INDEX.md.
 The classification table below is the single source of truth. When a new
 ADR is added under docs/adr/, append an entry to ``CLASSIFICATION``. The
 script exits 1 if any ADR file is missing from the table or any title
 cannot be parsed, so omissions surface in CI.
 Usage:
    python tools/generate_adr_index.py [--root <repo-root>] [--check]
  --check : exit 1 if the generated INDEX differs from the on-disk file
            (used by CI to detect un-regenerated indexes).
 """
 from __future__ import annotations
 import argparse
 import re
 import sys
 from pathlib import Path
 ADR_FILENAME_RE = re.compile(r"^ADR-(\d{4})-([a-z0-9_-]+)\.md$")
 # Title separator may be ":" (most ADRs) or "—" (em-dash; ADR-0033 uses
 # this). The verifier (tools/verify_adr_lang_pairs.py) only checks the
 # number, so both styles already coexist in the corpus.
 TITLE_RE = re.compile(r"^# ADR-(\d{4})\s*[:—]\s*(.+?)\s*$")
 DESIGN_PRINCIPLES = "Design Principles"
 HIGH_LEVEL = "High-level Architecture"
 DETAILED = "Detailed Architecture"
 IMPL_DECISIONS = "Implementation Decisions"
 # (section, subgroup) per ADR. subgroup is used to sub-divide Detailed
 # (by component, see DETAILED_COMPONENTS) and Implementation (by topic).
 # Add a line here when introducing a new ADR.
 CLASSIFICATION: dict[int, tuple[str, str | None]] = {
    # Design Principles
    13: (DESIGN_PRINCIPLES, None),
    33: (DESIGN_PRINCIPLES, None),
    # High-level Architecture
    3:  (HIGH_LEVEL, "System hierarchy (Tray / SIP / CUBE / PE)"),
    7:  (HIGH_LEVEL, "Runtime API ↔ sim_engine boundaries"),
    16: (HIGH_LEVEL, "IOChiplet NOC and memory data path"),
    17: (HIGH_LEVEL, "Cube NOC and HBM connectivity"),
    # Detailed Architecture (subgroup matches DETAILED_COMPONENTS entries)
    14: (DETAILED, "pe_pipeline"),  # covers pe_cpu/pe_dma/pe_fetch_store/pe_gemm/pe_math/pe_scheduler
    23: (DETAILED, "pe_ipcq"),
    34: (DETAILED, "hbm_ctrl"),
    35: (DETAILED, "m_cpu"),
    36: (DETAILED, "io_cpu"),
    37: (DETAILED, "forwarding"),
    38: (DETAILED, "pcie_ep"),
    39: (DETAILED, "pe_mmu"),
    40: (DETAILED, "pe_tcm"),
    41: (DETAILED, "sram"),
    42: (DETAILED, "tiling"),
    # Implementation Decisions
    1:  (IMPL_DECISIONS, "Address Scheme"),
    2:  (IMPL_DECISIONS, "Routing & Helper API"),
    4:  (IMPL_DECISIONS, "Memory Semantics & Local-HBM Bandwidth"),
    5:  (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
    6:  (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
    8:  (IMPL_DECISIONS, "Tensor Deployment and Allocation"),
    9:  (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"),
    10: (IMPL_DECISIONS, "CLI Surface and Semantics"),
    11: (IMPL_DECISIONS, "Address Scheme"),
    12: (IMPL_DECISIONS, "Kernel Execution and Host-Device Messaging"),
    15: (IMPL_DECISIONS, "Component Port/Wire Fabric Model"),
    20: (IMPL_DECISIONS, "Two-Pass Data Execution"),
    22: (IMPL_DECISIONS, "2D Grid Program Identity"),
    24: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
    25: (IMPL_DECISIONS, "IPCQ Direction Addressing"),
    26: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
    27: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
    32: (IMPL_DECISIONS, "Intercube All-Reduce"),
    43: (IMPL_DECISIONS, "Evaluation Harnesses"),
    44: (IMPL_DECISIONS, "Evaluation Harnesses"),
    45: (IMPL_DECISIONS, "Bench Module Contract"),
    46: (IMPL_DECISIONS, "Kernel-side tl.* API (TLContext)"),
    47: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
    48: (IMPL_DECISIONS, "Memory Allocator Algorithms"),
    49: (IMPL_DECISIONS, "Probe Subcommand"),
    50: (IMPL_DECISIONS, "Parallelism (Launcher, DP, TP, AHBM backend, CCL algorithm)"),
    51: (IMPL_DECISIONS, "Routing & Helper API"),
    52: (IMPL_DECISIONS, "Sim-engine Op Log and Memory Store Schemas"),
    53: (IMPL_DECISIONS, "Topology Compilation, Diagrams & Builder Algorithms"),
    54: (IMPL_DECISIONS, "Evaluation Harnesses"),
 }
 # Canonical component order for the Detailed Architecture section.
 # Each entry: (component_name, list[ADR-numbers that cover it]).
 # Order matches src/kernbench/components/builtin/*.py alphabetical
 # (the same order /report uses).
 DETAILED_COMPONENTS: list[tuple[str, list[int]]] = [
    ("forwarding",      [37]),
    ("hbm_ctrl",        [34]),
    ("io_cpu",          [36]),
    ("m_cpu",           [35]),
    ("pcie_ep",         [38]),
    ("pe_cpu",          [14]),
    ("pe_dma",          [14, 23]),
    ("pe_fetch_store",  [14]),
    ("pe_gemm",         [14]),
    ("pe_ipcq",         [23]),
    ("pe_math",         [14]),
    ("pe_mmu",          [39]),
    ("pe_scheduler",    [14]),
    ("pe_tcm",          [40]),
    ("sram",            [41]),
    ("tiling",          [42]),
 ]
 def _strip_bom(text: str) -> str:
    """Strip leading UTF-8 BOM if present."""
    if text and ord(text[0]) == 0xFEFF:
        return text[1:]
    return text
 def _find_adrs(adr_dir: Path) -> list[tuple[int, str, Path]]:
    """Return [(num, slug, path), ...] for ADR files in adr_dir, sorted by num."""
    out: list[tuple[int, str, Path]] = []
    for p in sorted(adr_dir.iterdir()):
        if not p.is_file():
            continue
        m = ADR_FILENAME_RE.match(p.name)
        if not m:
            continue
        out.append((int(m.group(1)), m.group(2), p))
    out.sort(key=lambda t: t[0])
    return out
 def _extract_title(path: Path) -> str:
    """Parse the title from the first line `# ADR-NNNN: <title>`. Strips BOM."""
    text = _strip_bom(path.read_text(encoding="utf-8"))
    first_line = text.split("\n", 1)[0] if text else ""
    m = TITLE_RE.match(first_line)
    if not m:
        raise ValueError(
            f"{path.name}: cannot parse title from first line: {first_line!r}"
        )
    return m.group(2)
 def _build_index(adr_dir: Path, link_prefix: str) -> str:
    """Build the INDEX.md text for adr_dir.
    link_prefix is the relative href used for ADR links (e.g., ``./``
    so links resolve relative to the INDEX file location).
    """
    adrs = _find_adrs(adr_dir)
    if not adrs:
        raise RuntimeError(f"No ADR files found under {adr_dir}")
    # Validate every ADR is classified.
    missing = sorted(num for num, _slug, _ in adrs if num not in CLASSIFICATION)
    if missing:
        raise RuntimeError(
            "ADR(s) missing from CLASSIFICATION table in "
            "tools/generate_adr_index.py: "
            + ", ".join(f"ADR-{n:04d}" for n in missing)
            + ". Add an entry for each."
        )
    # Map: num → (filename, title)
    num_to_meta: dict[int, tuple[str, str]] = {}
    for num, _slug, path in adrs:
        num_to_meta[num] = (path.name, _extract_title(path))
    # ── Section assembly ────────────────────────────────────────────
    lines: list[str] = []
    lines.append("# ADR Index")
    lines.append("")
    lines.append(
        f"Auto-generated by `tools/generate_adr_index.py`. "
        f"Total ADRs: **{len(adrs)}**."
    )
    lines.append("")
    lines.append(
        "Classification mirrors the `/report` skill's section assignment. "
        "When adding a new ADR, also add an entry to the "
        "`CLASSIFICATION` table in `tools/generate_adr_index.py`."
    )
    lines.append("")
    def fmt_entry(num: int) -> str:
        fname, title = num_to_meta[num]
        return f"- [ADR-{num:04d}]({link_prefix}{fname}) — {title}"
    # Design Principles
    lines.append("## Design Principles")
    lines.append("")
    nums = sorted(n for n, (sec, _) in CLASSIFICATION.items()
                  if sec == DESIGN_PRINCIPLES and n in num_to_meta)
    for n in nums:
        lines.append(fmt_entry(n))
    lines.append("")
    # High-level Architecture (preserve declaration order via CLASSIFICATION dict's insertion order)
    lines.append("## High-level Architecture")
    lines.append("")
    nums = sorted(n for n, (sec, _) in CLASSIFICATION.items()
                  if sec == HIGH_LEVEL and n in num_to_meta)
    for n in nums:
        sub = CLASSIFICATION[n][1] or ""
        fname, title = num_to_meta[n]
        if sub:
            lines.append(
                f"- [ADR-{n:04d}]({link_prefix}{fname}) — {title}"
                f"  _({sub})_"
            )
        else:
            lines.append(fmt_entry(n))
    lines.append("")
    # Detailed Architecture (canonical component order)
    lines.append("## Detailed Architecture")
    lines.append("")
    lines.append("One subsection per component file under `src/kernbench/components/builtin/`.")
    lines.append("")
    for comp, adr_nums in DETAILED_COMPONENTS:
        lines.append(f"### {comp}")
        lines.append("")
        if adr_nums:
            for n in adr_nums:
                if n not in num_to_meta:
                    raise RuntimeError(
                        f"DETAILED_COMPONENTS references ADR-{n:04d} for "
                        f"'{comp}' but no such ADR file exists."
                    )
                lines.append(fmt_entry(n))
        else:
            lines.append("_(no ADR coverage)_")
        lines.append("")
    # Implementation Decisions — group by subgroup, preserving first-appearance order.
    lines.append("## Implementation Decisions")
    lines.append("")
    topic_order: list[str] = []
    topic_to_nums: dict[str, list[int]] = {}
    for n, (sec, sub) in CLASSIFICATION.items():
        if sec != IMPL_DECISIONS or n not in num_to_meta:
            continue
        topic = sub or "Uncategorized"
        if topic not in topic_to_nums:
            topic_order.append(topic)
            topic_to_nums[topic] = []
        topic_to_nums[topic].append(n)
    # Stable order: by smallest ADR-number in topic, so older infra appears first.
    topic_order.sort(key=lambda t: min(topic_to_nums[t]))
    for topic in topic_order:
        lines.append(f"### {topic}")
        lines.append("")
        for n in sorted(topic_to_nums[topic]):
            lines.append(fmt_entry(n))
        lines.append("")
    return "\n".join(lines).rstrip() + "\n"
 def _check_or_write(path: Path, content: str, check: bool) -> bool:
    """Write content to path, or compare in --check mode. Returns True on diff."""
    existing = path.read_text(encoding="utf-8") if path.exists() else ""
    if check:
        if existing != content:
            print(f"[diff] {path} would change.")
            return True
        return False
    path.write_text(content, encoding="utf-8")
    if existing != content:
        print(f"[wrote] {path}")
    else:
        print(f"[unchanged] {path}")
    return False
 def main(argv: list[str] | None = None) -> int:
    p = argparse.ArgumentParser(description=__doc__)
    p.add_argument(
        "--root", type=Path, default=Path.cwd(),
        help="Repository root (default: cwd)",
    )
    p.add_argument(
        "--check", action="store_true",
        help="Exit 1 if generated INDEX would differ from disk",
    )
    args = p.parse_args(argv)
    en_dir = args.root / "docs" / "adr"
    ko_dir = args.root / "docs" / "adr-ko"
    if not en_dir.is_dir():
        print(f"error: {en_dir} does not exist", file=sys.stderr)
        return 1
    any_diff = False
    try:
        en_index = _build_index(en_dir, link_prefix="./")
    except (RuntimeError, ValueError) as e:
        print(f"error (EN): {e}", file=sys.stderr)
        return 1
    any_diff |= _check_or_write(en_dir / "INDEX.md", en_index, args.check)
    if ko_dir.is_dir():
        try:
            ko_index = _build_index(ko_dir, link_prefix="./")
        except (RuntimeError, ValueError) as e:
            print(f"error (KO): {e}", file=sys.stderr)
            return 1
        any_diff |= _check_or_write(ko_dir / "INDEX.md", ko_index, args.check)
    if args.check and any_diff:
        print(
            "INDEX.md is out of date. "
            "Run `python tools/generate_adr_index.py` to refresh.",
            file=sys.stderr,
        )
        return 1
    return 0
 if __name__ == "__main__":
    sys.exit(main())
Author	SHA1	Message	Date
mukesh	222815d374	attention: add rank_axis kwarg to mesh kernels for multi_user cube ring ADR-0059 single_user_* panels run the ring across PEs in one cube (rank == tl.program_id(axis=0)). multi_user_* panels run the ring across cubes — rank should be cube_id (axis=1), and 7 of every 8 PEs in each cube must stay silent because the cube-level SFR install only gives the cube-coordinate PE 0 an E/W neighbor. Add ``rank_axis: int = 0`` kwarg to both ``attention_mesh_mlo_kernel`` and ``attention_mesh_kv_kernel``: - 0 (default): rank == tl.program_id(axis=0). Existing single_user behavior, all spec tests unchanged. - 1: gate ``if tl.program_id(axis=0) != 0: return`` at kernel start, then ``rank = tl.program_id(axis=1)``. multi_user_* panels pass this to the kernel via ctx.launch positional arg. Also brings in _attention_mesh_kv.py and _attention_mesh_mlo.py as the committed home of the ADR-0059 kernels (previously living uncommitted in the working tree from sub-cycle 4b). Tests: 7-test rank_axis spec file (default-path + rank_axis=1 gating and cube-id semantics, both kernels); 4-panel diag harness now green end-to-end (single_user_prefill/decode + multi_user_prefill/decode); 763-test wider sweep clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-01 19:53:18 -07:00
mukesh	d9e767d048	runtime_api: ctx.launch honors DPPolicy.num_cubes + adds _auto_dim_remap opt-out Two compounding bugs in ctx.launch's dim-translation path surfaced by multi_user_* panels of milestone-gqa-llama70b (sub-cycle 4c step 2): Bug A: _compute_local_shape divided by self._num_cubes (the topology's cube count, 16 in default topology.yaml) instead of the DPPolicy's effective num_cubes (4 for validation-scale multi_user). The tensor allocator at context.py:471-484 already honored dp.num_cubes; the parallel computation inside launch was out of sync. Fix mirrors the allocator's eff_num_cubes precedence pattern. Bug B: dim_map was keyed by value, so any scalar whose value coincidentally equaled a global tensor dim got rewritten to that dim's local value — e.g. d_head=64 colliding with K's global M=64 in multi_user mode. Legacy bench kernels (va_offset etc.) rely on this remap, so the fix is opt-out: ctx.launch(..., _auto_dim_remap=False) preserves scalars exactly as passed. Default remains True. Tests: 3 new dim-translation tests + 4-panel diag harness covers single_user_* (PASS) and multi_user_* (advances to new SFR/axis layer failure, tracked separately). va_offset + full attention spec suite unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-01 19:33:40 -07:00
mukesh	313dee503c	sim_engine: fix IPCQ slot-wrap snapshot race in Phase 2 replay Phase 1 cannot snapshot math-output sources at outbound send time because math executes only in Phase 2 — so token.data stays None and PE_DMA inbound can't write the recv slot. For own-sends this is harmless (Phase 2 replay reads the stable scratch addr after math runs). For forwarded sends in mesh kernels (ADR-0059), src_addr is a recv slot that gets wrapped by later inbounds before this read's Phase 2 turn, yielding a shape mismatch on the fallback MemoryStore.read. Fix: DataExecutor maintains a per-slot, time-ordered, shape-keyed history. Every ipcq_copy write appends (t_write, value) to the slot's history; _resolve_read falls back to the most recent shape-matching entry with t_write <= the consuming op's t_start. Applied uniformly to _execute_memory, _execute_gemm, and _execute_math. Secondary: OpLogger.record_end for math ops now prefers TensorHandle.data carried by the input handle over a MemoryStore re-read, closing the smaller record-end race covered by the new test_op_log_input_snapshot_race.py unit tests. Tests: 4 new race tests + 6 existing op_log + mesh decode diag + mesh kv/mlo spec — all green. Full repo sweep: 760 passed (3 pre-existing failures unrelated: bench-registry list drift + Windows Tkinter env). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-01 19:14:09 -07:00
mukesh	b1d6fafd3a	eval: commit milestone bench output (track generated figures + results) Per request, the milestone bench output is now tracked in git instead of gitignored, so the figures/results are viewable on the remote: - src/kernbench/benches/1H_milestone_output/gemm/ (3 PNGs + gemm_sweep.json) - src/kernbench/benches/1H_milestone_output/ccl/ (3 per-topology PNGs, buffer-kind PNG+CSV, FSIM comparison PNG, topology.png, summary.csv) Drop the .gitignore rule; update ADR-0054 D3 + Negative (EN+KO) to say the output is committed (regenerable by rerunning the bench). Artifacts produced by full bench runs (milestone-1h-gemm non-FAST, milestone-1h-ccl). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 15:37:27 -07:00
mukesh	cc1bbd0ab7	eval: fold GEMM/allreduce harnesses into self-contained milestone benches Move the GEMM + allreduce sweep/render logic out of scripts/ and tests/ into two self-contained eval benches so a user can regenerate every result + figure with one command: kernbench run --bench milestone-1h-gemm (MILESTONE_FAST=1 reuses JSON) kernbench run --bench milestone-1h-ccl - benches/milestone_1h_{gemm,ccl}.py: single home for each domain; the run(torch) entry drives the sweeps and writes figures into benches/1H_milestone_output/{gemm,ccl}/ (gitignored), then submits a sentinel tensor to satisfy the run_bench contract. - tests/gemm + tests/sccl helpers and scripts/gemm_sweep.py become thin re-export/wrapper shims over the benches (single source preserved); the pytest-only param builders + _run_distributed wrapper stay in the shim. - eval-bench pattern: a bench may drive many configs + build its own per-config engines (extends ADR-0045 D5; reverses ADR-0044 D1/D2). ADR-0054 (EN+KO) records the design; ADR-0043/0044/0045 + CLAUDE.md CLI Semantics amended; ADR INDEX regenerated. Verified: milestone benches run clean (ok=True, all artifacts), full suite 67 passed, lang-pairs OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 15:19:52 -07:00
ywkang	e33e76f2d1	adr: add INDEX.md (auto-generated by tools/generate_adr_index.py) Adds a section-based table of contents for the 46-ADR corpus, mirroring the /report skill's classification (Design Principles / High-level Architecture / Detailed Architecture by component / Implementation Decisions by topic). Generated for both docs/adr/ (EN titles) and docs/adr-ko/ (KO titles) from one tool. tools/generate_adr_index.py: - Single CLASSIFICATION dict per ADR — add an entry when introducing a new ADR; the script fails loud if any file is missing from the table. - DETAILED_COMPONENTS lists each builtin component and the ADR(s) that cover it (ADR-0014 appears under six PE engines; ADR-0023 under pe_dma + pe_ipcq). - Accepts both ":" and "—" title separators (matching ADR-0033's existing format). - --check mode for CI: exits 1 if INDEX.md is stale. Also includes the docs/report/architecture-2026-1H.md generated by the prior /report write (the public-facing architecture document; 836 lines, 76 source-attribution comments). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 11:15:37 -07:00
ywkang	bd49c93703	adr: add ADR-0050-0053 — close /report's second-pass G4 candidates Documents four cross-cutting surfaces one layer deeper than the prior G4 batch: - 0050 par-ccl-algorithm-module-contract: how to author a new CCL algorithm in src/kernbench/ccl/algorithms/. Pairs with ADR-0045's bench-module contract. Pins the four required public symbols (kernel, kernel_args, TOPO_NAME_TO_KIND constants, kernel alias), the 9 + tl standardized kernel signature, the kernel_args tuple format, sip_topo_kind dispatch, and the ccl.yaml entry workflow. - 0051 lat-routing-helper-api: every public method of AddressResolver (resolve, find_m_cpu, find_pcie_ep, find_io_cpu, find_all_pcie_eps) and PathRouter (find_path, find_path_with_distance, find_mcpu_dma_path, find_memory_path, find_node_path + 2 shims). Pins the four adjacency graphs (_adj_all / _adj / _adj_mcpu_dma / _adj_local) and the edge-kind exclusion sets they use, plus the single-owner naming convention. - 0052 dev-oplog-memory-store-schemas: OpRecord's 7 fields, the per-op_name params matrix (dma_read, dma_write, gemm_*, math, math reduction, composite_gemm, ipcq_copy, unknown), snapshot timing rules (math = all inputs, dma_write = HBM-only — ADR-0027 race avoidance), TileToken stage_type capture, and MemoryStore's (space, addr) two-level dict with reference-store semantics. - 0053 dev-topology-builder-algorithms: the 6-stage compile pipeline, cube_mesh.yaml's source_hash cache and its 5 input fields, the cube NoC auto-layout algorithm (row/col placement, HBM exclusion zone, PE/M_CPU/SRAM attachment via nearest-router, UCIe N/S/E/W distribution), the node naming convention (single-owner with router.py), the edge-kind catalog, the 4 view projections, and a table of spec-field changes vs mesh regeneration. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 10:52:42 -07:00
ywkang	9a02955770	adr: add ADR-0046-0049 — close G4 coverage gaps from /report Documents four cross-cutting surfaces that previously had no ADR backing, each surfaced as a G4 candidate by /report: - 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...), the two execution modes (command-list vs greenlet runner), scratch allocator semantics, dispatch-overhead model, and the kernel registry. - 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group (backend="ahbm") install path. world_size priority (algorithm > defaults > topology), the 4-step init sequence (load ccl.yaml, import algorithm module, derive world_size, install SFR + IPCQ), greenlet- local rank registry, all_reduce dispatch via _defer_wait, barrier no-op rationale, and the explicit list of unsupported dist.* APIs. - 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator free-list semantics. Offset-keyed first-fit with coalescing, the no-validation trust model for free(), HBM/TCM channel separation, page-aligned VA allocation, the page_size dual-default (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and one-allocator-per-sub-unit rule. - 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog. H2D / D2H / PE DMA categories with their exact cube-index choices, the 32 KiB reference size, the 5-point utilization sweep, the formula vs actual column meanings, automatic invariant checks (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine isolation, and the human-readable (not machine-parsable) output contract. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 10:25:04 -07:00
ywkang	5f8dd688f5	adr: add ADR-0045 (bench module contract — registration, dispatch, authoring) Documents src/kernbench/benches/: how @bench registration + audit work, how the CLI dispatches via run_bench/RuntimeContext, and the contract a new bench module must satisfy. Nine decisions (D1-D9) cover: - @bench name/description rules and duplicate detection - Module-file convention (_-prefixed helpers vs bench modules) - def run(torch) signature; torch = RuntimeContext - Minimum-one-submit rule (else NO_REQUESTS) - Single-device convention + multi-SIP CCL exception (ADR-0024/0027) - resolve() name/index decision tree; indices are not a stable API - Exact RuntimeContext surface exposed to benches - Env-var parameterization (matmul_composite / gemm_sweep.py pattern) Four alternatives rejected with documented reasons (manifest YAML, decorator entry= arg, @multi_device_bench split, stable indices). Verifier (tools/verify_adr_lang_pairs.py) passes for EN/KO pair. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 16:29:45 -07:00