adr: add ADR-0046-0049 — close G4 coverage gaps from /report

Documents four cross-cutting surfaces that previously had no ADR backing, each surfaced as a G4 candidate by /report: - 0046 prog-tl-context-contract: the kernel-side tl.* API. Enumerates all primitives (ref/load/store/dot/composite/math/reduction/IPCQ/...), the two execution modes (command-list vs greenlet runner), scratch allocator semantics, dispatch-overhead model, and the kernel registry. - 0047 par-ahbm-ccl-backend: torch.distributed.init_process_group (backend="ahbm") install path. world_size priority (algorithm > defaults > topology), the 4-step init sequence (load ccl.yaml, import algorithm module, derive world_size, install SFR + IPCQ), greenlet- local rank registry, all_reduce dispatch via _defer_wait, barrier no-op rationale, and the explicit list of unsupported dist.* APIs. - 0048 mem-allocator-algorithms: VirtualAllocator + PEMemAllocator free-list semantics. Offset-keyed first-fit with coalescing, the no-validation trust model for free(), HBM/TCM channel separation, page-aligned VA allocation, the page_size dual-default (VirtualAllocator 2 MiB / _ensure_allocators 4 KiB fallback), and one-allocator-per-sub-unit rule. - 0049 ver-probe-subcommand: kernbench probe traffic-pattern catalog. H2D / D2H / PE DMA categories with their exact cube-index choices, the 32 KiB reference size, the 5-point utilization sweep, the formula vs actual column meanings, automatic invariant checks (monotonicity, D2H >= H2D, best < worst), per-case GraphEngine isolation, and the human-readable (not machine-parsable) output contract. Bilingual pair verifier passes for all four EN/KO pairs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 10:25:04 -07:00
parent 5f8dd688f5
commit 9a02955770
8 changed files with 2154 additions and 0 deletions
@@ -0,0 +1,307 @@
 # ADR-0046: TLContext — Kernel-side `tl.*` API Contract
 ## Status
 Accepted (2026-05-22).
 `src/kernbench/triton_emu/` 의 `TLContext` 가 노출하는 `tl.*` primitive
 집합과 그 의미, 그리고 두 실행 모드 (command-list / greenlet runner) 의
 계약을 명시한다. ADR-0014/0020 가 PE 파이프라인과 2-pass 실행 모델을
 정의하나, **bench 의 kernel 함수가 호출하는 `tl.*` 표면 자체**는 ADR-level
 에 정리되어 있지 않았다.
 ## First action (제일 처음에 하는 일)
 `TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id, num_cubes,
 scratch_base, scratch_size)` 생성 시 가장 먼저 다음 6개 필드를 초기화한다:
 - `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
  `tl.program_id` / `tl.num_programs` 가 반환할 값.
 - `self._dispatch_cycles` — 모든 `tl.*` API 호출 시작에서 자동으로 발행될
  `PeCpuOverheadCmd(cycles)` 의 cycle 수.
 - `self._runner` — `KernelRunner` 인스턴스 (있으면 greenlet 모드, 없으면
  command-list 모드).
 - `self._commands: list[PeCommand] = []` — command-list 모드에서 누적할
  command 시퀀스.
 - `self._handle_counter = 0`, `self._completion_counter = 0` — 새 TensorHandle /
  CompletionHandle id 생성용.
 - `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
  PE-로컬 scratch 영역 (math/dot/composite 의 output handle 주소 할당용).
 즉, **TLContext 의 첫 일은 "이 kernel 인스턴스가 어디서 (sip/cube/pe) 어떤
 규모 (num_programs/num_cubes) 로 실행되며, 어느 모드 (runner 유무) 로
 명령을 발사할지 메타데이터를 채우는 것"** 이다. 이 시점에 SimPy event 는
 없으며 command 도 발사되지 않는다.
 런타임 첫 동작은 kernel 함수가 `tl.<api>()` 를 처음 호출할 때 발생한다.
 모든 `tl.*` API 의 표준 entry 동작은:
 1. `self._emit_dispatch_overhead()` 호출 — `dispatch_cycles > 0` 인 경우
   `PeCpuOverheadCmd(dispatch_cycles)` 를 즉시 `_emit`.
 2. API 별 처리 (TensorHandle 생성, command 구성).
 3. `self._emit(cmd)` — runner 모드면 greenlet.switch 로 SimPy 측에 cmd 전달,
   아니면 `self._commands` 에 append.
 ## Context
 `tl.*` 표면은 `TLContext` 가 노출하는 메소드들로 구성되며, kernel 함수가
 받는 `tl` 매개변수가 이 객체다. 사용자(bench 작성자) 입장에서 보이는
 contract:
 - 어떤 primitive 가 있는가
 - 각 primitive 가 어떤 데이터 흐름을 발생시키는가 (DMA / compute / IPCQ /
  metadata-only)
 - TensorHandle 의 `space` 와 `addr` 가 어떻게 결정되는가
 - command-list 모드와 greenlet 모드의 차이
 ADR-0014 (PE pipeline) 가 PE_SCHEDULER 가 받는 PeCommand 들을 정의하나,
 `tl.*` 가 이들을 어떻게 emit 하는지는 코드 컨벤션에만 존재한다. 또한
 ADR-0020 (2-pass data execution) 가 greenlet 모드의 존재를 D3 에서
 언급하나, runner / non-runner 두 경로의 시그너처 차이 (return value 처리)
 는 ADR-level 에 명시되어 있지 않다. 이 ADR 이 그 빈자리를 채운다.
 ## Decision
 ### D1. `tl` 매개변수는 `TLContext` 인스턴스다
 bench 의 kernel 함수는 다음 시그너처를 따른다:
 ```python
 def _kernel(arg1, arg2, ..., tl, **kwargs):
    ...
 ```
 `tl` 의 정체는 `kernbench.triton_emu.tl_context.TLContext` 인스턴스이다.
 real Triton 의 `triton.language` 모듈을 흉내내기 위한 이름이며, real
 Triton 모듈이 들어오는 것은 아니다.
 kernel 함수는 일반 Python 함수이며 `yield` / `async` 가 없다. `tl.*`
 호출이 SimPy event 를 발생시키지만, 호출자(kernel) 쪽에서는 동기 호출처럼
 보인다 — greenlet 모드에서 KernelRunner 가 SimPy ↔ kernel 사이를 중계
 하기 때문 (ADR-0020 D3).
 ### D2. 두 실행 모드 — command-list / greenlet runner
 - **command-list 모드 (`runner is None`)**: `tl.*` 호출이 `self._commands`
  리스트에 PeCommand 를 누적. DMA / GEMM / Math 가 실제 SimPy 시간을
  소비하지 않으며, return value 가 metadata-only TensorHandle (data=None) 다.
  이후 PE_SCHEDULER / sim_engine 가 command 시퀀스를 시간상 재생.
 - **greenlet runner 모드 (`runner is not None`)**: `tl.*` 호출이
  `self._emit(cmd)` 를 통해 `runner.switch_to_simpy(cmd)` 로 부모 greenlet
  (SimPy) 으로 컨트롤을 넘김. 부모는 cmd 를 컴포넌트에 분배하여 SimPy 시간을
  소비한 뒤, DMA read 의 경우 실제 numpy 데이터를 반환. kernel 은 그
  결과를 받아 다음 line 으로 진행 (ADR-0020 D3 의 데이터 인지 실행 모델).
 mode 선택은 KernelRunner 인스턴스를 TLContext 에 주입하는지 여부로 결정
 되며, `tl.*` 메소드들은 이 차이를 인지하지 않고 `_emit()` 헬퍼를 통해
 일관되게 동작한다.
 ### D3. Primitive 카테고리
 #### D3.1. Reference (no DMA, metadata only)
 - `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: HBM 데이터를 참조하는
  핸들만 만들고 DMA 는 발행하지 않음. composite scheduler 가 per-tile 로
  스트리밍할 때 사용 (예: GEMM 의 b 피연산자).
 #### D3.2. Data movement (blocking, DMA engine)
 - `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → 결과 핸들.
  `DmaReadCmd` 발행. greenlet 모드에서는 결과 핸들의 `.data` 에 실제
  numpy 배열 첨부; command-list 모드에서는 placeholder. 반환 핸들의
  `space="hbm"`, `pinned=True`.
 - `tl.store(ptr, handle) -> None`: TCM → HBM. `DmaWriteCmd` 발행. greenlet
  모드에서는 `handle.data` 가 있을 때만 `_store.write("hbm", ptr, data)` 를
  먼저 호출 (visibility = issue time, ADR-0020 D3).
 #### D3.3. GEMM / compute (blocking)
 - `tl.dot(a, b) -> TensorHandle`: `a @ b`. 두 피연산자는 TCM 이어야 하며,
  shape (M,K) × (K,N) → (M,N). `GemmCmd` 발행, output handle 은
  `_make_compute_out(shape, dtype)` 로 PE-로컬 scratch 에 할당.
 - `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
  acc_dtype=None, tile_shape=None) -> CompletionHandle`: 비차단(non-blocking)
  tiled pipeline. `CompositeCmd` 발행. `epilogue` 는 dict list, 각 dict 는
  `"op"` 키 + op-specific 필드 + 옵션 `"scope"` (k_tile / output_tile);
  unknown op 나 missing field 는 즉시 ValueError. 반환된 CompletionHandle 은
  `tl.wait(h)` 로 동기화.
 #### D3.4. Math: unary (blocking)
 - `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
  `tl.cos(x)`, `tl.sin(x)` — 모두 `MathCmd(op=<name>, inputs=(x,), out=)`
  발행. `out` 은 동일 shape/dtype 의 scratch 할당.
 #### D3.5. Math: binary (blocking)
 - `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
 - `tl.fma(a, b, c)` — `a*b + c`. inputs 3개.
 - `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
 - `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
 - `tl.softmax(x, axis=-1)` — 단일 MathCmd(op="softmax") 로 시간 회계는
  한 번에. Phase 2 DataExecutor 가 canonical (x-max → exp → sum → div) 로
  expand 한다.
 #### D3.6. Reduction (blocking)
 - `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — 해당 axis 의
  크기를 1 로 줄인 output handle 을 반환. `MathCmd(op=<name>, inputs=(x,),
  out=, axis=axis)` 발행.
 #### D3.7. Index / scalar (PE_CPU, no engine)
 - `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE 인덱스),
  `axis==1` → cube_id (ADR-0022).
 - `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (cube 당
  PE 수), `axis==1` → num_cubes.
 - `tl.arange(start, end, dtype="i32") -> TensorHandle`: TCM 의 인덱스
  range. command 발사 없이 metadata 만.
 - `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape, value,
  dtype="f16") -> TensorHandle`: TCM 에 placeholder. command 발사 없음.
 #### D3.8. Scalar helpers (no command, no engine)
 - `TLContext.cdiv(a, b) -> int` (static): ceiling division
  `-(-a // b)`. real Triton 의 `tl.cdiv` 모방.
 #### D3.9. Metadata-only (no compute, no DMA)
 - `tl.trans(x) -> TensorHandle`: shape 의 마지막 두 dim 을 swap 한 새
  핸들. 같은 addr/data 를 공유, command 발사 없음.
 #### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
 - `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
  dtype="f16", space="tcm") -> None`: blocking send. handle 형태 또는
  raw 주소 형태 둘 다 허용. `IpcqSendCmd` 발행. handle 의 `.data` 스냅샷이
  명령에 실리는 경우, recv 측에서 받은 데이터의 race 회피.
 - `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
  dst_space=None) -> TensorHandle`: blocking recv. `dst_addr/dst_space`
  둘 다 주면 "copy_to_dst" 모드, 아니면 "return_slot" 모드. greenlet
  모드에서 핸들의 `.data` 에 실제 데이터 첨부.
 - `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
  **DIAGNOSTIC ONLY**. recv blocking 동기화는 그대로 적용되나 slot-read
  latency (slot-IO + PE↔bank fabric drain) 는 건너뛴다. pe2pe overview
  플롯에서 `tl.store` 와의 apples-to-apples 비교용. production kernel 은
  사용 금지 — `consume=False` 라는 별도 명령 분기로 격리되어 있어 실수
  flag 가 작동하지 않는다.
 - `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
  recv. `RecvFuture` 를 반환; 이후 `tl.wait(future)` 로 결과 수령.
 #### D3.11. Composite + control
 - `tl.composite(...)`: D3.3 에서 설명.
 - `tl.wait(handle=None)`: `CompletionHandle` (composite) 또는 `RecvFuture`
  (async recv) 또는 `None` (모든 pending composite) 대기.
 - `tl.cycles(n)`: PE_CPU scalar 실행 overhead 를 명시적으로 선언.
  `PeCpuOverheadCmd(cycles=n)` 발행.
 ### D4. TensorHandle 산술 연산자 — thread-local TLContext
 `tl_context.py` 모듈 로드 시점에 `_enable_tensor_ops()` 가 호출되어
 `TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__` 를 patch한다.
 각 연산자는 thread-local `_ctx` (모듈 변수) 에 저장된 active TLContext 의
 `_binary_math` 를 호출한다.
 따라서 kernel 안에서 `c = a + b` 는 `MathCmd(op="add", inputs=(a,b),
 out=)` 발행 + new TensorHandle 반환 패턴과 동일하다.
 active TLContext 관리:
 - `TLContext._set_active(ctx)`: 현재 thread/greenlet 의 active ctx 설정.
 - `TLContext._get_active()`: 조회 (없으면 RuntimeError).
 - `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. 진입 시
  active 설정, kernel 실행, 종료 시 None 으로 복원.
 `KernelRunner` 는 매 cmd 분배 시 `_switch_kernel` 가 직접 `_set_active(tl)`
 를 호출하여, 같은 thread 안의 다른 PE runner 가 active 를 덮어쓴 경우에도
 복원되도록 한다.
 ### D5. Scratch allocator — compute output handles
 `tl.dot`, `tl.exp`, `tl.add` (TensorHandle `__add__`) 등 결과를 만드는 op 는
 `_make_compute_out(shape, dtype)` 를 호출하여 16-byte aligned scratch
 주소를 할당한다. 이 주소는 `space="tcm"` 로 발행되며, 이후 `tl.send` /
 `tl.store` 가 이 handle 을 source 로 사용할 수 있다.
 `_scratch_base == 0` (command-list 모드 등) 이면 할당 주소가 0으로
 반환되어 handle 은 send/store 의 source 로 사용 불가 (이 경우 `tl.load`
 로 받은 핸들만 source 가 될 수 있다).
 cursor 가 `_scratch_size` (default 1 MiB) 를 초과하면 RuntimeError.
 cursor 는 매 kernel invocation 시작 시 0 으로 리셋되어야 하나 (현재 코드는
 KernelRunner 가 새 TLContext 를 매번 생성하여 자연스럽게 리셋됨).
 ### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
 모든 non-metadata `tl.*` 호출의 entry 에서 `_emit_dispatch_overhead()` 가
 호출되며 `dispatch_cycles > 0` 일 때 `PeCpuOverheadCmd(dispatch_cycles)`
 를 발행한다. PE_CPU 가 명령 dispatch 자체에 소비하는 cycle 비용을
 모델링하기 위함이다.
 기본값:
 - `TLContext.__init__` 의 `dispatch_cycles` 매개변수 기본값: 1 cycle.
 - `KernelRunner` 가 만드는 TLContext: 0 cycles (greenlet 모드는 cycle
  회계가 별도, ADR-0020 D3 정신).
 ### D7. Kernel registry (`triton_emu/registry.py`)
 별도의 `_kernels: dict[str, Callable]` 가 kernel 이름 → 함수 매핑을 보유:
 - `register_kernel(name, fn)`: duplicate 등록 시 ValueError.
 - `get_kernel(name)`: 미등록 시 KeyError.
 - `clear_registry()`: 테스트 전용.
 `RuntimeContext.launch(kernel_name, kernel_fn, *args)` 가 매 호출마다
 `_kernels[kernel_name] = kernel_fn` 으로 idempotent 덮어쓴다 (last call
 wins). 이는 ADR-0045 D8 의 launch 동작과 정합된다.
 PE_CPU 는 `KernelRef.name` 으로 registry 에서 kernel 함수를 lookup 한 뒤
 KernelRunner 로 실행한다.
 ## Alternatives Considered
 ### A1. tl.* 를 ADR-0014 / ADR-0020 안으로 통합
 기각. ADR-0014 는 PE pipeline (PeCommand 의 sim_engine 측 소비) 를, ADR-0020
 은 2-pass 실행 (Phase 1 timing / Phase 2 data) 을 다룬다. `tl.*` 는 kernel
 작성자가 만나는 API 표면이라 독립 분리하는 것이 검색성·온보딩 측면에서
 낫다.
 ### A2. command-list 모드 deprecation
 기각 (현재). 단순한 unit test 와 kernel verification 에서 command-list
 모드가 가볍게 동작한다. greenlet 의존성 없이 PeCommand 시퀀스를 검사할 수
 있는 출입구로 유지한다. greenlet 모드만의 의미 (실데이터, Phase 2) 가
 필요하면 D2 의 mode 선택으로 명시적으로 들어간다.
 ### A3. TensorHandle 산술 연산자 제거
 기각. real Triton 의 kernel 코드 가독성을 흉내내기 위함이며 (예: `c = a +
 b`), thread-local active ctx 패턴이 깔끔하게 작동 중. 명시적 `tl.add(a, b)`
 도 D3.5 에 노출되어 있어, 연산자가 헷갈리면 함수형 호출로 대체 가능.
 ### A4. softmax 를 명시적 시퀀스 (max → exp → sum → div) 로 expand
 부분 채택. `tl.softmax` 는 단일 `MathCmd(op="softmax")` 로 timing 회계는
 한 번에 처리한다 (D3.5). 실 데이터 expansion 은 Phase 2 DataExecutor 가
 canonical 시퀀스로 풀어준다. 즉, 시간 모델은 atomic, 데이터 모델은
 expansion — 두 마리 토끼를 의도적으로 분리.
 ## Consequences
 - bench 작성자가 만나는 모든 `tl.*` primitive 가 한 ADR 에 분류·정의됨.
  ADR-0045 D8 의 host-side surface (torch.empty 등) 와 짝을 이루어 "kernel
  안 / 밖" 양쪽 작성 가이드가 완성.
 - command-list / greenlet 두 모드의 차이가 D2 에 명시되어, 새로운 `tl.*`
  primitive 추가 시 `_emit()` 패턴만 따르면 양쪽 자동 호환됨.
 - thread-local active ctx 패턴 (D4) 이 ADR-level 에서 정당화되어, 향후
  multi-PE 동일-thread 실행 시 reset 책임이 어디인지 명확해짐
  (`_switch_kernel` 가 cmd 분배 시 active 복원 — KernelRunner.run 의
  contract).
 - `tl.recv_no_consume` 의 진단 전용 격리(D3.10) 가 ADR 에 굳어져, 실수로
  production kernel 에서 사용되는 것을 막는 layer 가 명확.
 - registry (D7) 가 별도 D 항목으로 분리되어, kernel 이름 충돌 / 동적
  재등록 동작의 사양이 명시.
@@ -0,0 +1,243 @@
 # ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
 ## Status
 Accepted (2026-05-22).
 `runtime_api/distributed.py` 의 `AhbmCCLBackend` + `DistributedContext` —
 즉 `torch.distributed.init_process_group(backend="ahbm")` 진입점이 실제로
 무엇을 설치하고 어떤 의미로 `all_reduce`/`barrier`/`get_rank` 등을
 구현하는지를 명시한다. ADR-0023 D11 이 "torch.distributed compatibility"
 의도를 언급하나, **backend 자체의 동작 모델**은 ADR-level 에 없었다.
 ## First action (제일 처음에 하는 일)
 `RuntimeContext.__post_init__` 가 자동으로 `DistributedContext()` 인스턴스를
 만들어 `self.distributed` 에 attach 한다. 그 시점의 첫 일은:
 1. `self._backend: AhbmCCLBackend | None = None` 으로 초기화 (아직 init
   되지 않은 상태).
 2. `self._rank_by_greenlet: dict = {}` 로 greenlet-local rank 레지스트리
   초기화 (ADR-0024 D2).
 3. 호출자(RuntimeContext) 측에서 `dc._ctx_ref = self` 로 back-reference 를
   심어, 이후 `init_process_group` 가 `ctx.engine` / `ctx.spec` / `ctx.launch`
   에 도달할 수 있게 한다.
 즉, **DistributedContext 의 첫 일은 "RuntimeContext 에 자기 자신을
 back-reference 와 함께 부착하고 backend 슬롯을 비워두는 것"**. 실제 backend
 설치(IPCQ install, world_size 산출, 알고리즘 모듈 로드)는 사용자 코드의
 `torch.distributed.init_process_group(backend="ahbm")` 호출 시점에 비로소
 일어난다.
 해당 시점의 `init_process_group` 의 첫 일은:
 1. `backend != "ahbm"` 이면 즉시 `ValueError("Unsupported backend ...")`.
 2. `getattr(self, "_ctx_ref", None)` 가 None 이면
   `RuntimeError("DistributedContext not bound to a RuntimeContext")`.
 3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — 이 생성자 안에서
   ccl.yaml load + 알고리즘 모듈 import + world_size 산출 + SFR 설정 +
   IPCQ install 이 모두 일어난다.
 4. `self._backend._dist_ctx = self` — backend 가 거꾸로
   `_rank_by_greenlet` 에 접근할 수 있게 함.
 ## Context
 PyTorch DDP 의 collective 호출 (`init_process_group`, `all_reduce` 등) 을
 그대로 사용할 수 있게 만들어, bench 코드가 "진짜 DDP training script" 와
 동일한 모습이 되도록 하는 것이 `AhbmCCLBackend` 의 목적이다 (ADR-0024 +
 ADR-0027 의 launcher 모델과 정렬).
 이 backend 가 책임지는 것:
 - `init_process_group` 시점에 **IPCQ neighbor table 을 한 번 설치** (real
  NCCL communicator creation 과 유사).
 - `all_reduce(tensor, op="sum")` 호출 시 **설정된 algorithm 의 kernel 함수
  를 `ctx.launch(...)` 로 발사**.
 - `get_world_size` / `get_rank` 를 greenlet-local rank 레지스트리와
  ccl.yaml/topology 로부터 일관되게 답함.
 ADR-0023 D10 (IPCQ install plan), ADR-0024 (SIP launcher) 가 부분적으로
 이를 다루나, **`AhbmCCLBackend` 자체의 책임 범위와 의사결정 순서**는
 어디에도 명시되어 있지 않다. 본 ADR 이 채운다.
 ## Decision
 ### D1. backend 는 `init_process_group(backend="ahbm")` 시점에만 생성된다
 `DistributedContext` 는 `__init__` 시점에 `_backend = None` 으로 시작한다.
 backend 객체는 사용자가 `dist.init_process_group(backend="ahbm")` 를
 호출하기 전까지 존재하지 않으며, 그 외 API (`is_initialized`,
 `get_world_size`, `all_reduce`, `barrier`) 가 backend 가 None 인 채로
 호출되면 `RuntimeError("Default process group has not been initialized...")`
 를 던진다 (`_ensure_initialized` 헬퍼).
 `backend != "ahbm"` 은 즉시 `ValueError`. 다른 backend 명 (nccl, gloo
 등) 은 인식하지 않는다.
 ### D2. world_size 산출 우선순위 — algorithm > defaults > topology
 `AhbmCCLBackend._resolve_world_size` (ADR-0024 D1) 의 결정 순서:
 1. `ccl.yaml` 의 algorithm entry 에 `world_size` 가 있으면 그 값.
 2. `defaults.world_size` 가 있으면 그 값.
 3. 둘 다 없으면 `spec.system.sips.count` (=topology 의 SIP 개수).
 기본 의미는 **rank = SIP** (ADR-0024). cube/PE-level parallelism 은 각
 rank 안에서 DPPolicy 로 표현되며 world_size 에 영향을 주지 않는다. 명시적
 `ccl.yaml` 의 world_size override 가 있으면 legacy "rank = flat PE 인덱스"
 테스트 경로를 위해 그대로 존중된다.
 `init_process_group(world_size=..., rank=...)` 의 사용자 인자는 **수신하나
 무시**된다 (real PyTorch 의 `RANK` / `WORLD_SIZE` env var 와 같은 의미).
 ### D3. `init_process_group` 가 즉시 하는 4가지 설치 작업
 `AhbmCCLBackend.__init__` 안에서 다음이 순차 실행된다:
 1. **ccl.yaml 로딩**: `kernbench.ccl.install.load_ccl_config()` →
   `resolve_algorithm_config(_cfg_all)` 로 `defaults.algorithm` (또는
   사용자가 지정한 알고리즘) 의 merged config 산출.
 2. **알고리즘 모듈 import**: `importlib.import_module(self._merged["module"])`.
   이 모듈은 `kernel` 함수, `kernel_args(world_size, n_elem, cube_w, cube_h)`
   helper, optional `TOPO_NAME_TO_KIND` 매핑을 노출해야 한다.
 3. **world_size 산출** (D2).
 4. **topology 메타 수집**: `spec` 으로부터 `n_sips`, `sip_topo` (`ring_1d`
   기본), `cube_w`/`cube_h`, `sips.w`/`sips.h`. SIP topology 가 ring_1d 가
   아니면 explicit `w`/`h` 또는 square root 로 (`w*h == n_sips` 보장)
   `_sip_topo_w/h` 산출. 불일치 시 `ValueError`.
 5. **SFR + IPCQ 설치**: `kernbench.ccl.sfr_config.configure_sfr_intercube_multisip
   (engine, spec, self._merged)` 를 호출. 이 함수가 모든 SIP/cube 의 pe0 에
   IPCQ neighbor table 을 푸시 (real NCCL communicator 의 일회성 설정에
   해당).
 이 순서가 변하면 (예: SFR 전에 algorithm 모듈 load 가 실패하면) 부분 초기화
 상태가 발생할 수 있다. 따라서 D3 는 atomic 한 4-단계로 본다 — 실패 시
 backend 는 미설치 상태로 남는다.
 ### D4. greenlet-local rank 등록 (ADR-0024 D2)
 `DistributedContext._rank_by_greenlet: dict[greenlet, int]` 은 spawn 된
 worker greenlet 각각에 rank 를 매핑한다. bench launcher (예:
 `torch.multiprocessing.spawn`) 가 worker 를 띄울 때
 `dc._bind_rank(g, rank)` 를 호출하여 등록한다.
 `get_rank()` 는 `getcurrent()` 의 greenlet 을 lookup. 미등록 greenlet은
 fallback 으로 0 을 반환 — single-driver / 테스트 호환성 유지.
 backend 는 `_dist_ctx._rank_by_greenlet` 를 통해 `all_reduce` 시 현재
 greenlet 의 rank 를 가져온다 (D5).
 ### D5. `all_reduce(tensor, op="sum")` 동작
 검증 단계:
 - `op != "sum"` → `NotImplementedError`. 현재 kernel 들은 add reduction만 구현.
 - `tensor._handle is None` → `RuntimeError("not deployed")`.
 - `tensor._handle.shards` 가 비면 `RuntimeError("no shards")`.
 준비 단계:
 - `n_elem = shards[0].nbytes // tensor.itemsize` — 단일 shard 의 element 수.
 - `kernel_fn = self._algo_module.kernel` — D3 에서 import 된 알고리즘 모듈의
  진입 함수.
 - effective cube dims 결정: 첫 번째 SIP 의 cube 갯수가 1 이면 (1,1) 으로
  scalar 처리, 아니면 토폴로지의 `cube_w`/`cube_h` 사용. TP 가 일부 cube
  만 쓰는 경우를 자연스럽게 흡수.
 - `kernel_args = self._algo_module.kernel_args(world_size, n_elem, cube_w,
  cube_h)` — 알고리즘이 자기 kernel 에 넘길 인자 셋을 결정.
 dispatch:
 - 현재 greenlet 의 rank 를 `_rank_by_greenlet.get(g, 0)` 로 lookup.
 - `extra_args = (sip_rank, sip_topo_kind, sip_topo_w, sip_topo_h)` 를 append.
 - `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor, *kernel_args,
  *extra_args, _defer_wait=True)` — `_defer_wait=True` 로 collective drain
  을 메인 scheduler 에 위임 (ADR-0027 D0.4).
 drain:
 - 부모 greenlet 이 살아있으면 (multi-greenlet 모드) `_pending_collective_handles`
  에 enqueue 한 뒤 부모로 switch. 메인 scheduler 가 모든 rank 의 launch 후
  일괄 drain.
 - 단일-driver 모드면 inline 으로 `for h, _sip_id, meta in pending:
  self.ctx.wait(h, _meta=meta)` 즉시 drain.
 ### D6. `barrier()` 는 no-op 이다 (single-driver 모델)
 kernbench 는 하나의 Python process 안에서 모든 rank 를 greenlet 으로 다룬다.
 process 간 동기화가 필요한 상황이 없으므로 `barrier()` 는 호출 가능하지만
 실제 어떤 동기화도 수행하지 않는다. real PyTorch DDP 와의 API 호환성을
 위해 유지 (호출자가 NotImplementedError 를 받지 않도록).
 장래에 multi-process kernbench (예: SimPy event loop 가 process 별로
 독립) 가 도입되면 D6 를 supersede 하는 새 ADR 이 필요.
 ### D7. `get_rank` / `get_world_size` / `get_backend` 의 의미
 - `get_rank()` (D4): 현재 greenlet 의 bound rank. 미등록은 0.
 - `get_world_size()` (D2): backend 가 D3 에서 산출한 world_size.
 - `get_backend()`: 항상 `"ahbm"` 문자열. backend 객체가 존재하지 않으면
  `_ensure_initialized` 에서 RuntimeError.
 real PyTorch 와의 차이:
 - real PyTorch `get_rank()` 는 process global 값이지만, kernbench 는
  greenlet-local. spawn 된 worker 안에서 호출하면 rank, main thread 에서
  호출하면 0. bench 작성자는 worker 함수 안에서만 의미 있는 rank 를 기대해야
  한다.
 ### D8. 지원하는 API 표면 (final)
 `DistributedContext` 가 노출하는 API:
 - `init_process_group(backend="ahbm", world_size=None, rank=None, **kwargs)`
 - `is_initialized() -> bool`
 - `get_world_size() -> int`
 - `get_rank() -> int`
 - `get_backend() -> str`
 - `all_reduce(tensor, op="sum") -> None`
 - `barrier() -> None`
 - (internal) `_bind_rank(g, rank)`
 이외의 PyTorch distributed API (broadcast, reduce, all_gather, gather,
 scatter, send/recv 등) 는 **아직 구현되어 있지 않다**. kernel 레벨에서는
 `tl.send`/`tl.recv` (ADR-0046 D3.10) 로 직접 표현 가능하나, dist.* surface
 로는 노출되지 않는다. 추가 collective 가 필요해질 시 별도 알고리즘 모듈
 + `DistributedContext` 메소드 한 쌍을 추가하여 D8 를 확장한다.
 ## Alternatives Considered
 ### A1. backend 를 `RuntimeContext.__init__` 에서 즉시 생성
 기각. ccl.yaml 이 없거나 알고리즘 모듈을 import 할 수 없는 경우, bench 가
 distributed 기능을 안 쓰는데도 RuntimeContext 생성 자체가 실패하게 된다.
 "호출 시점에 비로소 설치" (D1) 가 lazy 의미상 옳다.
 ### A2. world_size 를 항상 topology 로부터 자동 산출 (override 금지)
 기각. ADR-0024 D1 의 "explicit override" 경로가 legacy 테스트에서 사용 중.
 한 SIP 안에서 PE-level rank 를 따로 정의해야 하는 진단 시나리오를 위해
 유지.
 ### A3. `op != "sum"` 을 silent fallback 으로 처리
 기각. 사용자가 `op="prod"` / `"max"` / `"avg"` 를 의도했는데 silently sum
 이 실행되면 결과 검증이 매우 어렵다. 명시적 `NotImplementedError` 가 안전.
 ### A4. `barrier` 를 SimPy event 로 구현
 기각 (현재). single-driver 모델에서 cross-process 동기화 의미가 없으므로
 no-op 가 의미적으로 정확. SimPy fake-barrier 는 의미 없이 코드 복잡도만
 높임. multi-process kernbench 도입 시 재평가.
 ## Consequences
 - `torch.distributed.init_process_group(backend="ahbm")` 의 4-단계 설치
  (D3) 가 ADR-level 에서 굳어져, 향후 새 collective 알고리즘이 어디에
  훅을 걸어야 하는지 명확.
 - D2 의 우선순위 (algorithm > defaults > topology) 가 명시되어, ccl.yaml
  변경 시 영향 범위를 빠르게 가늠 가능.
 - D6 의 barrier no-op 결정이 ADR-level 에 굳어져, multi-process kernbench
  도입 시 별도 ADR 로 supersede 해야 함이 분명.
 - D8 의 미지원 API 목록이 명시되어, 사용자가 `dist.broadcast(...)` 를
  호출하려 할 때의 명확한 거절 근거 제공.
@@ -0,0 +1,262 @@
 # ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
 ## Status
 Accepted (2026-05-22).
 `policy/address/allocator.py` 의 `_FreeList` / `PEMemAllocator` 와
 `va_allocator.py` 의 `VirtualAllocator` 가 사용하는 free-list 알고리즘,
 페이지 정렬, coalescing 규칙을 명시한다. ADR-0001 (PhysAddr 레이아웃) 과
 ADR-0011 (PA/VA/LA 모델) 이 주소 스킴을 정의하나, **할당 알고리즘**은 별도
 ADR 이 없었다.
 ## First action (제일 처음에 하는 일)
 ### `_FreeList(capacity)`
 생성 즉시 `self._capacity = capacity`, `self._used = 0`, `self._free =
 [(0, capacity)]` 로 초기화. 첫 일은 **전 영역을 single free block 으로
 세우는 것** — 즉 `(offset=0, size=capacity)` 한 튜플이 free list 의 유일한
 원소다.
 ### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
 생성 즉시 두 개의 `_FreeList` 를 만든다:
 - `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — 이 PE 가 소유한 HBM
  slice 의 바이트 크기 (`hbm_bytes_per_cube // hbm_slices_per_cube`) 만큼.
 - `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — `tcm_bytes_per_pe -
  tcm_scheduler_reserved_bytes` 만큼 (scheduler 예약분은 사전 분리).
 따라서 PEMemAllocator 의 첫 일은 **이 PE 의 HBM slice 와 사용자
 TCM 영역을 각각 단일 free block 으로 세우는 것**.
 ### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
 생성 즉시 `self._va_base = va_base`, `self._va_size = va_size`,
 `self._page_size = page_size`, `self._used = 0`, `self._free = [(va_base,
 va_size)]`. 첫 일은 **VA base 부터 size 까지 single block 으로 세우고
 page_size 를 회수**.
 ## Context
 `runtime_api/context.py::_ensure_allocators` 는 다음 단계로 allocator 세트를
 구성한다:
 1. spec 으로부터 `hbm_total_gb_per_cube`, `hbm_slices_per_cube`,
   `tcm_size_mb`, target_device 별 SIP 범위 등을 읽음.
 2. `AddressConfig` 로 모든 파라미터를 frozen 하게 패킹.
 3. target SIP 범위 × cube × PE 의 모든 조합에 대해
   `PEMemAllocator(sip, cube, pe, cfg)` 인스턴스를 1개씩 생성.
 4. `VirtualAllocator(va_base=0x1_0000_0000, va_size=64 GiB,
   page_size=pe_mmu.page_size)` 를 1개 생성.
 allocator 들의 책임:
 - **PEMemAllocator**: PE-로컬 HBM slice / TCM 의 PA-공간 할당 (PhysAddr
  encoding 까지 포함).
 - **VirtualAllocator**: device-wide VA 공간을 페이지 정렬로 할당. 이후
  `RuntimeContext._create_tensor` 가 VA → PA 매핑을 `MmuMapMsg` 로 fabric
  에 push.
 이 알고리즘들은:
 - **first-fit** 으로 단순.
 - 자유 블록 리스트는 **offset 정렬 (sorted by start)** 유지.
 - `free()` 시 **양쪽 인접 블록과 coalesce**.
 이런 결정의 근거가 어디에도 없으므로, 향후 누군가 "왜 best-fit 이 아닌가",
 "왜 buddy allocator 가 아닌가", "왜 partial overlap free 가 silently
 허용되는가" 라는 질문에 답할 기준이 필요. 본 ADR 이 그 기준을 마련한다.
 ## Decision
 ### D1. `_FreeList` — offset-기반 first-fit + coalescing
 `policy/address/allocator.py::_FreeList`:
 - 내부 표현: `list[tuple[int, int]]` = `[(start_offset, size), ...]` —
  start offset 으로 정렬된 자유 블록의 sorted list.
 - `alloc(nbytes)`:
  1. free list 를 앞에서부터 순회 (first-fit).
  2. 처음 만나는 `size >= nbytes` 인 블록에서 앞부분을 잘라 사용.
  3. 정확히 일치하면 블록 통째로 제거; 아니면 `(start+nbytes, size-nbytes)`
     로 축소.
  4. `_used += nbytes`, 잘라낸 `start` 반환.
  5. 맞는 블록이 없으면 `AllocationError("overflow ... largest free block
     ...")`.
 - `free(offset, nbytes)`:
  1. `_used -= nbytes`.
  2. `bisect_left(self._free, (offset,))` 로 삽입 위치 결정.
  3. 직전 블록과 인접 (`prev_start + prev_size == offset`) 하면 흡수.
  4. 직후 블록과 인접 (`offset+nbytes == next_start`) 하면 흡수.
  5. coalesced range 를 정렬 위치에 insert.
 이 알고리즘은 fragmentation 에 약점이 있으나 (best-fit / buddy 대비), 본
 시뮬레이터의 워크로드 특성상 (deploy/free 패턴이 거의 stack-like) 충분
 하다는 것이 디자인 가정이다. 워크로드가 변하면 D1 supersede 후보.
 ### D2. partial overlap free 는 **검사하지 않는다**
 `_FreeList.free(offset, nbytes)` 는 호출자가 정확한 (offset, nbytes) 를
 넘긴다고 신뢰한다. 다음을 검증하지 않는다:
 - 그 range 가 실제로 alloc 된 것인지.
 - 그 range 가 다른 alloc 된 영역과 겹치지 않는지.
 이유: 시뮬레이터 컨텍스트에서 호출자는 항상 `alloc()` 의 반환값을 그대로
 저장했다가 `free()` 에 넘기는 패턴이며, 외부 사용자 입력이 아니다. 안전성
 검사를 추가하면 매 free 마다 O(N) 비용이 들어 시뮬 wall-clock 에 영향.
 이 신뢰 모델이 깨지면 (예: 두 텐서가 같은 PA 를 가리키는 코드 경로 도입)
 즉시 ADR-level 으로 재검토.
 ### D3. `PEMemAllocator` — HBM/TCM 두 채널 분리
 `PEMemAllocator(sip_id, die_id, pe_id, cfg)` 는 두 `_FreeList` 를 보유:
 - `_hbm`: `cfg.hbm_slice_bytes` 크기.
 - `_tcm`: `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
  tcm_scheduler_reserved_bytes`) 크기.
 `alloc_hbm(nbytes) -> PhysAddr`:
 - `_hbm.alloc(nbytes)` 로 offset 획득.
 - `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id, pe_local_hbm_offset=offset,
  slice_size_bytes=cfg.hbm_slice_bytes)` 로 PA 인코딩.
 - 실패 시 `AllocationError("HBM overflow ...")`.
 `free_hbm(pa, nbytes)`:
 - `pa.hbm_offset - pe_id * cfg.hbm_slice_bytes` 로 PE-local offset 복원.
 - `_hbm.free(offset, nbytes)`.
 `alloc_tcm(nbytes) -> PhysAddr`: 유사하게 `PhysAddr.pe_tcm_addr` 로 인코딩.
 `free_tcm(pa, nbytes)`: `pa.sub_offset` 을 그대로 사용 (TCM 은 PE-local
 offset 이 곧 sub_offset).
 scheduler-reserved TCM 영역 (`cfg.tcm_scheduler_reserved_bytes`) 은
 allocator 가 인지하지 않는다 (`_tcm` 의 capacity 에서 사전 차감되어 있음).
 이는 ADR-0014 의 PE_SCHEDULER 내부 buffer 예약과 정합된다.
 ### D4. `VirtualAllocator` — 페이지 정렬 first-fit + coalescing
 `policy/address/va_allocator.py::VirtualAllocator`:
 - 내부 표현: `_FreeList` 와 동일한 sorted `list[tuple[int, int]]`.
  최초: `[(va_base, va_size)]`.
 - `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
 - `alloc(nbytes) -> int`:
  1. `aligned = _align_up(nbytes)`.
  2. first-fit 으로 `size >= aligned` 인 블록 탐색.
  3. 블록 앞부분 `aligned` 만큼 잘라 사용. 정확히 일치하면 제거.
  4. `_used += aligned`. 블록 `start` (= aligned 된 VA) 반환.
  5. 실패 시 `VaAllocationError`.
 - `free(va, nbytes)`: `_align_up(nbytes)` 단위로 free. _FreeList 와 동일한
  coalesce 알고리즘.
 `page_size` 의 실제 값은 두 곳에서 다른 기본을 갖는다:
 - `VirtualAllocator.__init__` 의 매개변수 기본값: `2 MiB`. 직접 호출하는
  테스트가 그대로 받는다.
 - `RuntimeContext._ensure_allocators` 가 인스턴스화할 때:
  `pe_mmu.attrs.get("page_size", 4096)` — `topology.yaml` 의
  `pe_mmu.attrs.page_size` 가 있으면 그 값, 없으면 fallback 4 KiB.
 두 기본이 다른 이유: VirtualAllocator 의 standalone 기본은 ADR-0039 의
 PE_MMU stopgap 기본 (2 MiB) 과 정합되어 직접 테스트가 자연스럽고, context
 fallback 의 4 KiB 는 topology 미설정 시 안전한 minimum page 다. 실제 사용
 경로는 항상 후자이며 (`_ensure_allocators` 가 인스턴스화하므로),
 `topology.yaml` 에서 `page_size` 가 명시되면 그 값이 양쪽 (MMU + VA
 allocator) 으로 일관되게 흐른다.
 만약 이 일치가 깨지면 (예: VirtualAllocator 의 page_size 를 PE_MMU 와
 다르게 인스턴스화) MMU `map()` 가 서브-페이지 region 모드 (ADR-0039 D3) 로
 흐른다.
 VA 기본 범위: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64 GiB`. 이
 값은 `_ensure_allocators` 에 하드코딩되어 있으며 ADR-0011 의 VA 모델에서
 직접적인 의미를 갖지는 않는다 — 단지 host 코드와 충돌하지 않을 만큼 큰
 주소 공간을 device-wide 로 잡아둔 것.
 ### D5. allocator 인스턴스의 lifecycle
 - `RuntimeContext._ensure_allocators` 가 lazy 하게 호출됨 (`_create_tensor`
  의 첫 호출 시점).
 - 한 번 생성된 allocator dict (`self._allocators`) 는 RuntimeContext 의
  lifetime 동안 재사용. 같은 process 안의 두 번째 deploy 는 새 객체를
  만들지 않는다.
 - `RuntimeContext.cleanup()` 이 모든 living tensor 의 `_free_tensor()` 를
  호출 → MMU unmap + `va_allocator.free` + `pemem_allocator.free_hbm` 으로
  free list 가 원상복구. 다음 RuntimeContext 가 다시 만들면 초기 상태부터.
 allocator 상태가 RuntimeContext 간에 공유되지 않는 점이 단일 process 안의
 연속 실행에서 deploy → cleanup → deploy 의 결정성을 보장한다.
 ### D6. Allocator 실패는 raise 한다 (silent OOM 금지)
 `_FreeList.alloc` / `VirtualAllocator.alloc` 모두 충분한 free block 이
 없으면 `AllocationError` / `VaAllocationError` 를 던진다. 메시지에는
 "required size + largest available block" 가 포함되어, fragmentation
 인지 진짜 OOM 인지 진단 가능.
 silent fallback (예: 가장 큰 블록만큼만 alloc) 는 절대 금지 — 부분 할당된
 텐서가 SimPy 단계에 들어가면 라우팅·DMA 가 잘못된 PA 를 인지하여 시뮬
 정확도가 깨진다.
 ### D7. address space 와 allocator 의 1:1 대응
 물리 주소 공간 분리는 PhysAddr 의 sub-unit (ADR-0001 D2.3) 으로 표현되며,
 각 sub-unit 마다 별도 allocator 인스턴스를 둔다:
 - HBM slice → `PEMemAllocator._hbm`.
 - PE TCM → `PEMemAllocator._tcm`.
 - (현재 미사용) M_CPU local memory, CUBE SRAM → 별도 allocator 필요. 현재
  구현은 아직 IPCQ-only slot 으로 처리 (ADR-0023 D9.7) 하며 PA 공간을
  share 하지 않으므로 별도 free-list 가 없음.
 cube-level SRAM allocator 가 필요해지면 `_FreeList(cfg.sram_bytes_per_cube)`
 인스턴스를 cube 단위로 추가한다 (`cfg.sram_bytes_per_cube` 는 이미
 `AddressConfig` 에 정의되어 있어 데이터 모델은 준비됨).
 ## Alternatives Considered
 ### A1. best-fit / buddy allocator
 기각 (현재). 워크로드의 alloc/free 패턴이 stack-like (deploy 순서 = free
 순서) 라 first-fit + coalescing 으로 fragmentation 이 충분히 통제된다.
 LLM kernel sweep 에서 long-running fragmentation 이 관찰되면 buddy 로
 교체하는 ADR 을 별도로 만든다.
 ### A2. partial overlap free 검증 추가
 기각. D2 의 신뢰 모델 + O(N) 검사 비용. 단, 디버그 모드 (`KERNBENCH_DEBUG`
 env var 등) 에서 활성화하는 옵션은 후속 작업으로 가능.
 ### A3. VA 와 PA 의 통합 allocator
 기각. VA 공간 (64 GiB device-wide) 과 PA 공간 (slice 별 ~6 GiB) 는 의미
 차원이 다르다. VA 는 host kernel 의 view, PA 는 device sub-unit 의 view.
 ADR-0011 의 VA 모델 정신 (MMU 가 둘 사이를 매핑) 과 정합하기 위해
 allocator 도 분리.
 ### A4. page_size 의 multi-tier 지원 (large page + small page)
 기각 (현재). 단일 page_size (현재 2 MiB) 가 LLM kernel 의 텐서 단위 (수
 MiB~수 GiB) 에 맞고, ADR-0039 D3 의 서브-페이지 region 으로 작은 매핑이
 필요할 때 흡수된다. multi-tier page 는 MMU 자체 모델을 확장해야 하므로
 별도 ADR 후보.
 ## Consequences
 - allocator 알고리즘이 ADR-level 에서 굳어져 (D1·D3·D4), 새로운 시뮬
  시나리오에서 fragmentation 이슈가 발생할 때 "여기서 first-fit + coalesce
  를 쓰고 있다" 가 명확.
 - D2 의 신뢰 모델이 명시되어, 향후 사용자 입력으로부터 직접 alloc/free 를
  받는 경로가 도입되면 본 ADR supersede 가 필요함을 일찍 인지 가능.
 - D7 의 sub-unit별 allocator 1:1 대응이 명시되어, M_CPU/SRAM 별도 영역이
  필요해질 때 어디에 free-list 를 추가해야 하는지 명확.
 - `VirtualAllocator` 의 page_size 가 PE_MMU 설정과 일치해야 함이 D4 에
  적혀 있어, 향후 topology.yaml 의 page_size 변경 시 ADR-0039 stopgap 동작
  과의 상호작용을 빠르게 가늠 가능.
@@ -0,0 +1,231 @@
 # ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
 ## Status
 Accepted (2026-05-22).
 `probes/probe.py` 의 `run_probe(...)` 가 노출하는 traffic-pattern catalog,
 formula vs actual 비교, 그리고 monotonicity / D2H≥H2D 같은 invariant
 체크의 의미를 명시한다. ADR-0010 (CLI surface) 가 `kernbench probe`
 subcommand 를 enumerate 하나, **probe 가 실제로 측정하는 것**과 **어떤
 invariant 를 PASS/FAIL 로 판정하는가**는 ADR-level 에 없었다.
 ## First action (제일 처음에 하는 일)
 `run_probe(topology_path, case_filter=None)` 의 첫 4가지 작업:
 1. `Path(topology_path).expanduser().resolve()` 로 절대 경로 산출.
 2. `load_topology(path)` → `TopologyGraph` 인스턴스 (그래프 + spec).
 3. `_build_edge_map(graph)` → `{(src, dst): Edge}` 빠른 lookup 테이블.
 4. `AddressResolver(graph)` + `PathRouter(graph)` 인스턴스화.
 그 다음 `nbytes = 32768` (= 32 KiB, summary table 의 기준 데이터 크기) 와
 `show_all = (case_filter is None or case_filter == "all")` 를 설정.
 즉, **probe 의 첫 일은 "토폴로지를 한 번 로드하여 edge map / resolver /
 router 를 준비하고, 32 KiB 라는 표준 측정 크기를 픽스하는 것"**. 그 이후
 H2D → D2H → PE DMA 세 카테고리의 case 들이 각각 별도의 `GraphEngine`
 인스턴스에서 실행된다 (case 간 cross-talk 차단).
 ## Context
 `kernbench probe` 는 다음 의도로 도입된 verification 도구다:
 - **수동 분석 ground truth**: 실 시뮬레이션 (`kernbench run --bench ...`)
  결과의 latency 가 비정상으로 보일 때, 단순 traffic pattern 의 정답을 별도
  로 얻어 비교.
 - **formula vs actual 비교**: 분석 모델 (wire latency + overhead + drain)
  과 시뮬레이션 결과 (`total_ns`) 가 일치하는지 확인. 일치하지 않으면 모델
  단순화 가정 (ADR-0033) 어디가 빠진 것인지 단서.
 - **monotonicity check**: hop 수가 늘면 latency 가 단조 증가해야 한다는
  invariant 의 자동 확인.
 - **utilization sweep**: 데이터 크기 (4 KiB ~ 1 MiB) 별 BW 활용률 표.
 이 도구의 동작 사양이 ADR-level 에 없으면:
 - 다른 형식의 traffic pattern (예: MCpuDma, IPCQ) 을 추가하려는 사람이 기존
  카테고리의 표 포맷 / 측정 단위를 일관되게 따르기 어렵다.
 - monotonicity 가 무엇을 기준으로 검사되는지 (hop 수? cube 거리? wire
  길이?) 모호.
 - 32 KiB 라는 기준 크기와 `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]` sweep
  의 의미가 코드 grep 으로만 확인 가능.
 ## Decision
 ### D1. 세 가지 case category — H2D / D2H / PE DMA
 각 category 는 토폴로지 상 별개의 데이터 경로를 가지며, 별도의 summary
 table + sweep table + route detail block 으로 출력된다.
 - **H2D (Host→Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
  dst_pe=0, pattern="zero")` 가 `pcie_ep → io_cpu → m_cpu → hbm_ctrl` 경로
  를 흐른다. cube 인덱스로 hop 수가 증가:
  - h2d-1hop: cube=0, hops=1
  - h2d-2hop: cube=4, hops=2
  - h2d-3hop: cube=8, hops=3
  - h2d-4hop: cube=12, hops=4
 - **D2H (Device→Host Read)**: `MemoryReadMsg(src_sip=0, src_cube, src_pe=0)`.
  forward command path + reverse data path 의 합 latency. 같은 4 hops
  카테고리.
 - **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe, dst_pa)`.
  5 가지 케이스로 cube/PE 위치 변화:
  - pe-local-hbm: same cube, same PE
  - pe-same-half-hbm: same cube, different PE (PE 1)
  - pe-cross-half-hbm: same cube, far PE (PE 4)
  - pe-cross-cube-hbm-best: adjacent cube (cube 1)
  - pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
 cube 인덱스가 4/8/12 (H2D), 1/4/15 (PE DMA) 같이 의미 있는 이유는
 4x4 cube mesh (sip.cube_mesh.w=4, h=4) 에서의 거리 정의 — 추후 cube_mesh
 크기 변경 시 이 값들이 같이 갱신되어야 한다.
 ### D2. 표준 측정 크기 — `nbytes = 32768` (32 KiB)
 모든 case 의 summary table 은 `nbytes=32768` 로 한 번 실행한 결과를
 보여준다. 32 KiB 가 선택된 이유:
 - DMA overhead 와 BW drain 이 한쪽으로 치우치지 않는 적당한 크기.
 - 다수 sub-unit (TCM, register file) 의 1회 transfer 단위와 비교 가능.
 크기별 utilization 변화는 별도 sweep table 이 보여준다 (D3).
 ### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
 `SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`, `SWEEP_LABELS =
 ["4KB", "16KB", "64KB", "256KB", "1MB"]`. 매 size 마다 다음 공식:
 ```
 drain   = nbytes / bottleneck_bw
 total   = overhead + wire + drain
 eff_bw  = nbytes / total
 util%   = eff_bw / bottleneck_bw × 100
 ```
 `bn_bw is None or <= 0` 이면 그 컬럼은 0.0 % 로 출력. 의미: hop 수가 늘
 수록 작은 transfer 는 overhead-bound, 큰 transfer 는 drain-bound 가 되는
 패턴을 한 표에서 확인.
 ### D4. 측정 항목 — actual / formula / breakdown
 각 case 행에 표시되는 컬럼:
 - `Actual` (total_ns): SimPy 실행 결과의 `trace["total_ns"]`.
 - `Ovhd`: 경로상 모든 node 의 `node.attrs["overhead_ns"]` 합 (formula
  breakdown).
 - `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
 - `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
 - `Ovhd%` / `Drain%`: Ovhd/Drain 이 Actual 에서 차지하는 비율 (formula 의
  Wire 는 통상 매우 작아 표시하지 않음).
 - `Eff.BW`: `nbytes / total_ns` (실 측정 BW).
 - `BN.BW`: bottleneck bandwidth (formula). path 상 모든 edge 의 BW 중 최소.
  edge BW 가 없으면 "-".
 - `Util%`: `Eff.BW / BN.BW × 100`. 100% 면 single-stream BW upper bound 에
  도달.
 formula 의 합 (`wire + ovhd + drain`) 과 actual 의 차이가 크면 모델
 단순화가 잡지 못하는 요소가 있다는 신호 (ADR-0033 의 가정 점검).
 ### D5. Invariant 자동 체크 — PASS/FAIL
 다음 invariant 들이 자동으로 확인되어 `[v] PASS` / `[x] FAIL` 로 출력:
 - **H2D / D2H monotonic increase**: hop 수가 늘면 actual latency 가
  단조 증가해야 함. `all(lats[i] < lats[i+1] for ...)`.
 - **D2H ≥ H2D**: 같은 hop 인덱스에서 D2H ≥ H2D (D2H 는 forward command
  + reverse data 두 leg 이므로). `all(d2h[i].total >= h2d[i].total)`.
 - **PE DMA best < worst**: cross-cube best (adjacent) latency < cross-cube
  worst (diagonal) latency.
 - **PE DMA local vs remote**: local BN BW vs remote BN BW 의 비교 출력
  (PASS/FAIL 이 아닌 정보성).
 체크가 FAIL 이면 사람이 즉시 모델/토폴로지 회귀를 인지할 수 있도록 한
 줄로 분명하게 출력.
 ### D6. Route detail — per-hop timestamp trace
 summary 와 sweep 표 이후 각 case 의 path 와 per-hop 누적 시간 (
 `_hop_timestamps`) 가 별도 섹션에서 출력된다:
 - H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
  (`m_cpu → hbm_ctrl`) + per-hop trace.
 - D2H: forward (cmd, no data) + reverse (data) trace 분리 표시.
 - PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
 각 hop 의 timestamp 는 cumulative `wire_ns + overhead_ns` 누적. terminal
 hop 의 annotation 에 `drain:Xns` 가 붙는다. bottleneck edge 는
 `<BN:XXGB/s>` 로 표시되어 시각적으로 식별 가능.
 ### D7. case_filter 인자의 의미
 - `None` 또는 `"all"`: 모든 case 실행 (default).
 - 다른 문자열: 그 이름과 정확히 일치하는 case 만 실행. 예: `kernbench
  probe --case h2d-2hop`.
 각 카테고리 안에서 `name != case_filter` 면 skip 되며, 그 카테고리의
 monotonicity / D2H≥H2D 비교는 데이터가 1개일 때 자연히 skip 된다.
 CLI parser 의 `--case` 기본값은 `"all"`이라 인자 생략 시 전체 실행.
 ### D8. 매 case 별 fresh GraphEngine
 H2D 4개, D2H 4개, PE DMA 5개의 case 가 각각 **새로운 GraphEngine**
 인스턴스에서 실행된다 (`engine = GraphEngine(graph)`). 이유:
 - case 간 누적 상태 (op_log, completion 추적, allocator 등) 가 cross-talk
  하지 않도록 격리.
 - 한 case 의 traffic 이 다른 case 의 BW 측정에 영향을 주지 않도록 보장.
 이 격리는 probe 의 측정 결과를 **각 case 단독 single-flow** 의 latency 로
 해석할 수 있게 한다. multi-flow contention 측정은 별도 도구 (예:
 `pe2pe_overview` 플롯, ADR-0033 의 multi-flow merging 모델) 책임.
 ### D9. 출력 포맷의 안정성
 probe 의 stdout 출력은 사람이 읽기 위함이며, 정확한 컬럼 폭/구분자/공백 은
 machine-readable contract 가 아니다. 자동화된 도구가 probe 결과를 파싱
 하려면 별도 JSON 출력 모드를 추가해야 한다 (현재 미구현).
 PASS/FAIL 줄의 `[v]` / `[x]` 접두사는 CI grep 용 anchor 로 안정 보장.
 ## Alternatives Considered
 ### A1. Probe 를 별도 bench 로 등록 (`@bench(name="probe")`)
 기각. probe 는 bench 가 아니라 verification 도구로 의도된다 — sweep / 분석
 용 multi-engine 실행과 invariant PASS/FAIL 출력이 본질이며, ADR-0045 의
 "단일 디바이스 + 단일 RuntimeContext" bench 모델과 맞지 않는다.
 ### A2. monotonicity 위반 시 exit code 1
 기각 (현재). 인간 검사 도구 위주로 의도되어 있어 PASS/FAIL 줄을 출력하고
 exit 0 로 종료. CI 가 violation 으로 fail 하길 원하면 별도 wrapper 가
 `grep "\[x\]"` 결과로 판단하면 됨. 후속으로 strict-mode flag (`--strict`)
 도입 가능.
 ### A3. probe 의 case 정의를 외부 YAML 로
 기각 (현재). 8개 case (4 H2D + 4 D2H + 5 PE DMA — 합 13개) 는 코드에
 하드코딩되어 있고 의미가 토폴로지 mesh 구조에 단단히 묶여 있다. 외부
 YAML 로 옮기면 cube 인덱스의 의미 (4, 8, 12 / 1, 4, 15) 를 별도로 문서화
 해야 하므로 응집도 손실. 케이스 추가가 잦아지면 그때 별도 ADR 로 도입.
 ### A4. multi-flow contention 측정 추가
 기각 (probe 범위 밖). D8 에서 명시한 single-flow 격리 모델이 probe 의 핵심
 의도. multi-flow contention 은 ADR-0033 latency model 의 다른 영역으로,
 별도 도구 또는 별도 case category 로 처리.
 ## Consequences
 - probe 의 case catalog (D1) 와 측정 단위 (D2/D3) 가 ADR-level 에서 명시
  되어, 새 traffic 카테고리 추가 시 어떤 표 포맷을 따라야 하는지 분명.
 - formula vs actual 의 컬럼 의미 (D4) 가 굳어져, probe 결과를 보고 "왜
  Drain% 가 5% 인가 / 70% 인가" 같은 질문을 빠르게 ADR-0033 가정 점검으로
  연결 가능.
 - invariant 자동 체크 (D5) 가 ADR 에 굳어져, 향후 latency 모델 변경 시
  monotonicity / D2H≥H2D 회귀를 probe 가 즉시 잡아낸다는 안전망 정착.
 - D8 의 case 간 격리가 명시되어, probe 결과를 single-flow 측정으로 안전
  하게 해석 가능. multi-flow 측정이 필요해지면 별도 도구 트랙이 필요함이
  분명.
 - A2 의 strict-mode flag 가 후속 작업 후보로 기록되어, CI 통합 요구 시
  최소 추가 작업으로 도입 가능.
@@ -0,0 +1,327 @@
 # ADR-0046: TLContext — Kernel-side `tl.*` API Contract
 ## Status
 Accepted (2026-05-22).
 Documents the set of `tl.*` primitives exposed by
 `src/kernbench/triton_emu/`'s `TLContext`, their semantics, and the two
 execution-mode contracts (command-list / greenlet runner). ADR-0014/0020
 defines the PE pipeline and the 2-pass execution model, but **the `tl.*`
 surface that bench kernel functions call** had no ADR-level coverage.
 ## First action
 When `TLContext(pe_id, num_programs, dispatch_cycles, runner, cube_id,
 num_cubes, scratch_base, scratch_size)` is instantiated, the first action
 is to initialize six categories of state:
 - `self._pe_id`, `self._num_programs`, `self._cube_id`, `self._num_cubes` —
  values that `tl.program_id` / `tl.num_programs` will return.
 - `self._dispatch_cycles` — cycle count emitted as `PeCpuOverheadCmd(cycles)`
  at the start of every `tl.*` API call.
 - `self._runner` — `KernelRunner` instance (present → greenlet mode;
  absent → command-list mode).
 - `self._commands: list[PeCommand] = []` — command-list accumulator
  (command-list mode only).
 - `self._handle_counter = 0`, `self._completion_counter = 0` — counters
  for generating TensorHandle / CompletionHandle ids.
 - `self._scratch_base`, `self._scratch_size`, `self._scratch_cursor = 0` —
  PE-local scratch region (used for math/dot/composite output handle
  addresses).
 In short, **TLContext's first act is "record where (sip/cube/pe) and at
 what scale (num_programs/num_cubes) this kernel instance runs, and pick
 its dispatch mode (runner present or not)"**. No SimPy event is created
 and no command is emitted at this moment.
 The runtime first action happens when the kernel function first calls a
 `tl.<api>()`. The standard entry for every `tl.*` API is:
 1. Call `self._emit_dispatch_overhead()` — if `dispatch_cycles > 0`,
   immediately `_emit` a `PeCpuOverheadCmd(dispatch_cycles)`.
 2. Per-API processing (TensorHandle creation, command construction).
 3. `self._emit(cmd)` — in runner mode this `greenlet.switch()`es the cmd
   to SimPy; in command-list mode it appends to `self._commands`.
 ## Context
 The `tl.*` surface consists of `TLContext`'s methods, and the `tl`
 parameter received by a kernel function is one of these objects. The
 contract the user (bench author) sees:
 - Which primitives exist.
 - What data flow each primitive triggers (DMA / compute / IPCQ /
  metadata-only).
 - How a TensorHandle's `space` and `addr` are decided.
 - The difference between command-list and greenlet modes.
 ADR-0014 (PE pipeline) defines the PeCommands consumed by PE_SCHEDULER,
 but how `tl.*` emits them is a code-only convention. ADR-0020 (2-pass
 data execution) mentions greenlet mode in D3 but does not pin down the
 signature difference (return-value handling) between the runner /
 non-runner paths. This ADR fills the gap.
 ## Decision
 ### D1. The `tl` parameter is a `TLContext` instance
 A bench kernel function has the signature:
 ```python
 def _kernel(arg1, arg2, ..., tl, **kwargs):
    ...
 ```
 `tl` is a `kernbench.triton_emu.tl_context.TLContext` instance. The name
 imitates real Triton's `triton.language` module; the actual Triton
 module is **not** passed in.
 The kernel is plain Python — no `yield` or `async`. `tl.*` calls produce
 SimPy events, but to the caller they appear synchronous because in
 greenlet mode the KernelRunner relays between SimPy and the kernel
 (ADR-0020 D3).
 ### D2. Two execution modes — command-list / greenlet runner
 - **Command-list mode (`runner is None`)**: `tl.*` calls append PeCommand
  to `self._commands`. DMA / GEMM / Math consume no SimPy time and return
  metadata-only TensorHandles (`data=None`). PE_SCHEDULER / sim_engine
  later replays the command sequence in time.
 - **Greenlet runner mode (`runner is not None`)**: `tl.*` calls
  `self._emit(cmd)` → `runner.switch_to_simpy(cmd)`, handing control to
  the parent greenlet (SimPy). The parent distributes the cmd to
  components, consumes SimPy time, and (for DMA reads) returns real numpy
  data. The kernel receives the result and continues to the next line
  (the data-aware execution model from ADR-0020 D3).
 The choice of mode is decided by whether a KernelRunner is injected into
 the TLContext. The `tl.*` methods themselves are mode-blind — they go
 through `_emit()` uniformly.
 ### D3. Primitive categories
 #### D3.1. Reference (no DMA, metadata only)
 - `tl.ref(ptr, shape, dtype="f16") -> TensorHandle`: create a handle
  referencing HBM data without issuing DMA. Used when the scheduler
  streams the data per-tile (e.g., the b operand of a composite GEMM).
 #### D3.2. Data movement (blocking, DMA engine)
 - `tl.load(ptr, shape, dtype="f16") -> TensorHandle`: HBM → handle.
  Emits `DmaReadCmd`. In greenlet mode the returned handle's `.data`
  carries real numpy data; in command-list mode it is a placeholder.
  The handle has `space="hbm"`, `pinned=True`.
 - `tl.store(ptr, handle) -> None`: TCM → HBM. Emits `DmaWriteCmd`. In
  greenlet mode, when `handle.data` is present, `_store.write("hbm",
  ptr, data)` runs first (visibility = issue time, ADR-0020 D3).
 #### D3.3. GEMM / compute (blocking)
 - `tl.dot(a, b) -> TensorHandle`: `a @ b`. Both operands must live in
  TCM; shapes `(M,K) × (K,N) → (M,N)`. Emits `GemmCmd`; the output
  handle is allocated from PE-local scratch via
  `_make_compute_out(shape, dtype)`.
 - `tl.composite(op, a, b=None, out_ptr=0, math_op=None, epilogue=None,
  acc_dtype=None, tile_shape=None) -> CompletionHandle`: non-blocking
  tiled pipeline. Emits `CompositeCmd`. `epilogue` is a list of dicts,
  each with `"op"` plus op-specific fields and an optional `"scope"`
  (k_tile / output_tile). Unknown ops or missing fields raise
  ValueError immediately. The returned CompletionHandle synchronizes
  via `tl.wait(h)`.
 #### D3.4. Math: unary (blocking)
 - `tl.exp(x)`, `tl.log(x)`, `tl.sqrt(x)`, `tl.abs(x)`, `tl.sigmoid(x)`,
  `tl.cos(x)`, `tl.sin(x)` — each emits `MathCmd(op=<name>,
  inputs=(x,), out=)`. `out` is scratch-allocated with the same
  shape/dtype as `x`.
 #### D3.5. Math: binary (blocking)
 - `tl.maximum(a, b)`, `tl.minimum(a, b)` — `_binary_math`.
 - `tl.fma(a, b, c)` — `a*b + c`. Three inputs.
 - `tl.clamp(x, min, max)` — `MathCmd(op="clamp", inputs=(x, min, max))`.
 - `tl.where(cond, a, b)` — `MathCmd(op="where", inputs=(cond, a, b))`.
 - `tl.softmax(x, axis=-1)` — a single `MathCmd(op="softmax")` so timing
  accounts at one dispatch. Phase 2 DataExecutor expands it to the
  canonical (x-max → exp → sum → div) sequence.
 #### D3.6. Reduction (blocking)
 - `tl.sum(x, axis)`, `tl.max(x, axis)`, `tl.min(x, axis)` — return an
  output handle with the axis size collapsed to 1. Emit
  `MathCmd(op=<name>, inputs=(x,), out=, axis=axis)`.
 #### D3.7. Index / scalar (PE_CPU, no engine)
 - `tl.program_id(axis=0) -> int`: `axis==0` → pe_id (cube-local PE
  index), `axis==1` → cube_id (ADR-0022).
 - `tl.num_programs(axis=0) -> int`: `axis==0` → num_programs (PEs per
  cube), `axis==1` → num_cubes.
 - `tl.arange(start, end, dtype="i32") -> TensorHandle`: an index range
  in TCM. No command emitted.
 - `tl.zeros(shape, dtype="f16") -> TensorHandle`, `tl.full(shape,
  value, dtype="f16") -> TensorHandle`: TCM placeholder. No command
  emitted.
 #### D3.8. Scalar helpers (no command, no engine)
 - `TLContext.cdiv(a, b) -> int` (static): ceiling division
  `-(-a // b)`. Mirrors real Triton's `tl.cdiv`.
 #### D3.9. Metadata-only (no compute, no DMA)
 - `tl.trans(x) -> TensorHandle`: a new handle with the last two dims
  swapped. Shares `addr` and `data`; no command emitted.
 #### D3.10. IPCQ (CCL) primitives (ADR-0023 D4)
 - `tl.send(dir, src=None, *, src_addr=None, nbytes=None, shape=None,
  dtype="f16", space="tcm") -> None`: blocking send. Accepts either
  handle form or raw-address form. Emits `IpcqSendCmd`. The handle's
  `.data` snapshot rides along on the command — avoiding the race
  where a later inbound IPCQ overwrites the slot before the outbound
  PE_DMA reads it.
 - `tl.recv(dir=None, shape=(), dtype="f16", space="tcm", dst_addr=None,
  dst_space=None) -> TensorHandle`: blocking recv. Providing both
  `dst_addr` and `dst_space` enters "copy_to_dst" mode; otherwise
  "return_slot" mode. In greenlet mode the handle's `.data` carries
  the real data.
 - `tl.recv_no_consume(dir=None, shape=(), dtype="f16") -> TensorHandle`:
  **DIAGNOSTIC ONLY**. Has the same blocking-arrival semantics as
  `tl.recv` but skips the slot-read latency charge (slot-IO + PE↔bank
  fabric drain). Used in the pe2pe overview plot for an apples-to-apples
  comparison against `tl.store`. Production kernels MUST NOT use it —
  the diagnostic flag is isolated in its own command branch
  (`consume=False`) so it cannot be accidentally enabled.
 - `tl.recv_async(dir, shape=(), dtype="f16") -> RecvFuture`: non-blocking
  recv. Returns a `RecvFuture`; resolved later by `tl.wait(future)`.
 #### D3.11. Composite + control
 - `tl.composite(...)`: see D3.3.
 - `tl.wait(handle=None)`: wait on a `CompletionHandle` (composite), a
  `RecvFuture` (async recv), or `None` (all pending composites).
 - `tl.cycles(n)`: declare a scalar PE_CPU overhead. Emits
  `PeCpuOverheadCmd(cycles=n)`.
 ### D4. TensorHandle arithmetic operators — thread-local TLContext
 At module load, `tl_context.py::_enable_tensor_ops()` runs and patches
 `TensorHandle.__add__`, `__sub__`, `__mul__`, `__truediv__`. Each
 operator calls `_binary_math` on the active TLContext stored in a
 module-level thread-local `_ctx`.
 So inside a kernel, `c = a + b` is equivalent to emitting
 `MathCmd(op="add", inputs=(a, b), out=)` and returning a new
 TensorHandle.
 Active-TLContext management:
 - `TLContext._set_active(ctx)`: set the active ctx for the current
  thread/greenlet.
 - `TLContext._get_active()`: read it (RuntimeError if unset).
 - `run_kernel(kernel_fn, tl_ctx, *args, **kwargs)`: helper. Sets active
  on entry, runs the kernel, restores `None` on exit.
 `KernelRunner` re-asserts `_set_active(tl)` inside its `_switch_kernel`
 just before resuming the kernel, so a sibling PE runner that overwrote
 the thread-local context is correctly recovered.
 ### D5. Scratch allocator — compute output handles
 Ops that produce a result — `tl.dot`, `tl.exp`, `tl.add` (via
 TensorHandle `__add__`), etc. — call `_make_compute_out(shape, dtype)`
 to obtain a 16-byte-aligned scratch address. The address is published
 with `space="tcm"`, so the handle can later be the source of a
 `tl.send` / `tl.store`.
 When `_scratch_base == 0` (e.g., command-list mode), the address is 0
 and the handle cannot be a send/store source (in that case, only
 `tl.load`-returned handles are valid sources).
 When the cursor exceeds `_scratch_size` (default 1 MiB), a
 RuntimeError is raised. The cursor must reset between kernel
 invocations (current code naturally satisfies this: KernelRunner
 creates a fresh TLContext each time).
 ### D6. Dispatch overhead — `PeCpuOverheadCmd(dispatch_cycles)`
 Every non-metadata `tl.*` call starts with `_emit_dispatch_overhead()`,
 which — when `dispatch_cycles > 0` — emits
 `PeCpuOverheadCmd(dispatch_cycles)`. This models the cycles PE_CPU
 spends dispatching the command.
 Defaults:
 - `TLContext.__init__`'s `dispatch_cycles` parameter default: `1` cycle.
 - TLContext built by `KernelRunner`: `0` cycles (greenlet mode handles
  cycle accounting differently — aligned with ADR-0020 D3 intent).
 ### D7. Kernel registry (`triton_emu/registry.py`)
 A separate `_kernels: dict[str, Callable]` holds the name → function
 mapping:
 - `register_kernel(name, fn)`: ValueError on duplicate.
 - `get_kernel(name)`: KeyError if missing.
 - `clear_registry()`: test-only.
 `RuntimeContext.launch(kernel_name, kernel_fn, *args)` overwrites
 `_kernels[kernel_name] = kernel_fn` on every call (last-call-wins,
 idempotent) — consistent with ADR-0045 D8's `launch` behavior.
 PE_CPU looks up `KernelRef.name` in the registry and runs the function
 through KernelRunner.
 ## Alternatives Considered
 ### A1. Fold `tl.*` into ADR-0014 / ADR-0020
 Rejected. ADR-0014 covers the PE pipeline (sim_engine-side consumption
 of PeCommands); ADR-0020 covers 2-pass execution (Phase 1 timing /
 Phase 2 data). The `tl.*` surface is what the kernel author touches; a
 dedicated ADR improves findability and onboarding.
 ### A2. Deprecate command-list mode
 Rejected (currently). Simple unit tests and kernel verification benefit
 from the lighter command-list path — it exposes a PeCommand sequence
 inspector without requiring greenlet machinery. When greenlet-mode
 semantics (real data, Phase 2) are needed, D2 explicitly selects them.
 ### A3. Remove TensorHandle arithmetic operators
 Rejected. They mimic real Triton kernel ergonomics (e.g., `c = a + b`),
 and the thread-local active-ctx pattern works cleanly. The explicit
 function-form (`tl.add(a, b)`) is also exposed in D3.5, so the
 operators are syntactic sugar.
 ### A4. Expand softmax into the explicit sequence (max → exp → sum → div)
 Partially adopted. `tl.softmax` is a single `MathCmd(op="softmax")` for
 timing accounting (D3.5), but Phase 2 DataExecutor expands it to the
 canonical sequence for real-data computation. Timing model atomic,
 data model expanded — the two split intentionally.
 ## Consequences
 - Every `tl.*` primitive a bench author meets is classified and defined
  in a single ADR. Paired with ADR-0045 D8's host-side surface
  (`torch.empty` etc.), the inside-kernel and outside-kernel authoring
  guides are now complete.
 - The command-list / greenlet difference is pinned in D2, so any new
  `tl.*` primitive that follows the `_emit()` pattern auto-supports
  both modes.
 - The thread-local active-ctx pattern (D4) is justified at ADR level,
  clarifying who owns the reset responsibility when multiple PE
  runners share a thread (KernelRunner.run's contract restores active
  inside `_switch_kernel`).
 - `tl.recv_no_consume`'s diagnostic isolation (D3.10) is hardened in
  ADR form — accidental production use is blocked by a separate
  command branch.
 - The registry (D7) gets its own D-section, formalizing the
  name-collision and dynamic-re-registration semantics.
@@ -0,0 +1,259 @@
 # ADR-0047: AHBM CCL Backend — `torch.distributed`-compat shim
 ## Status
 Accepted (2026-05-22).
 Pins down what `runtime_api/distributed.py`'s `AhbmCCLBackend` +
 `DistributedContext` actually install — i.e., the entry point
 `torch.distributed.init_process_group(backend="ahbm")` — and how
 `all_reduce`/`barrier`/`get_rank` etc. are implemented. ADR-0023 D11
 mentions the "torch.distributed compatibility" intent, but **the backend
 itself** had no ADR-level coverage.
 ## First action
 `RuntimeContext.__post_init__` automatically constructs a
 `DistributedContext()` and attaches it to `self.distributed`. The first
 action at that moment:
 1. `self._backend: AhbmCCLBackend | None = None` — uninitialized.
 2. `self._rank_by_greenlet: dict = {}` — greenlet-local rank registry
   (ADR-0024 D2).
 3. The caller (RuntimeContext) sets `dc._ctx_ref = self` so subsequent
   `init_process_group` can reach `ctx.engine` / `ctx.spec` / `ctx.launch`.
 In short, **DistributedContext's first act is "attach to RuntimeContext
 with a back-reference and leave the backend slot empty"**. Actual
 backend installation (IPCQ install, world_size derivation, algorithm
 module import) happens only when user code calls
 `torch.distributed.init_process_group(backend="ahbm")`.
 At that moment, `init_process_group`'s first action is:
 1. If `backend != "ahbm"`, raise `ValueError("Unsupported backend ...")`
   immediately.
 2. If `getattr(self, "_ctx_ref", None)` is None,
   `RuntimeError("DistributedContext not bound to a RuntimeContext")`.
 3. `self._backend = AhbmCCLBackend(torch_ctx=ctx)` — inside this
   constructor, ccl.yaml is loaded, the algorithm module is imported,
   world_size is derived, SFR is configured, and IPCQ is installed.
 4. `self._backend._dist_ctx = self` — the backend gets a back-reference
   so it can read `_rank_by_greenlet`.
 ## Context
 The `AhbmCCLBackend` exists so that PyTorch DDP collective calls
 (`init_process_group`, `all_reduce`, etc.) work unchanged and bench code
 reads identically to a real DDP training script (in line with
 ADR-0024 + ADR-0027's launcher model).
 The backend's responsibilities:
 - At `init_process_group` time, install the **IPCQ neighbor table once**
  (analogous to NCCL communicator creation).
 - For each `all_reduce(tensor, op="sum")`, dispatch the configured
  algorithm's kernel function via `ctx.launch(...)`.
 - Answer `get_world_size` / `get_rank` consistently from the
  greenlet-local rank registry plus ccl.yaml/topology.
 ADR-0023 D10 (IPCQ install plan) and ADR-0024 (SIP launcher) touch
 parts of this, but **the backend's own responsibility scope and decision
 order** are not pinned anywhere. This ADR fills that gap.
 ## Decision
 ### D1. The backend is created only at `init_process_group(backend="ahbm")` time
 `DistributedContext` starts with `_backend = None`. The backend object
 does not exist until the user calls
 `dist.init_process_group(backend="ahbm")`. Any other API
 (`is_initialized`, `get_world_size`, `all_reduce`, `barrier`) called
 while `_backend` is None raises
 `RuntimeError("Default process group has not been initialized...")` via
 the `_ensure_initialized` helper.
 `backend != "ahbm"` raises `ValueError` immediately. Other backend names
 (`nccl`, `gloo`, etc.) are not recognized.
 ### D2. world_size resolution priority — algorithm > defaults > topology
 `AhbmCCLBackend._resolve_world_size` (ADR-0024 D1):
 1. If `ccl.yaml`'s algorithm entry has `world_size`, use it.
 2. Else if `defaults.world_size` is set, use it.
 3. Else fall back to `spec.system.sips.count` (the topology's SIP count).
 The default interpretation is **rank = SIP** (ADR-0024). Cube/PE-level
 parallelism is expressed inside each rank via DPPolicy and does not
 affect world_size. An explicit `ccl.yaml` override is preserved for the
 legacy "rank = flat PE index" test path.
 User arguments to `init_process_group(world_size=..., rank=...)` are
 **accepted but ignored** (same as real PyTorch's `RANK` / `WORLD_SIZE`
 env vars).
 ### D3. `init_process_group` performs four installation steps
 Inside `AhbmCCLBackend.__init__`, in order:
 1. **Load ccl.yaml**: `kernbench.ccl.install.load_ccl_config()` →
   `resolve_algorithm_config(_cfg_all)` produces the merged config for
   `defaults.algorithm` (or the user-specified algorithm).
 2. **Import algorithm module**:
   `importlib.import_module(self._merged["module"])`. The module must
   expose a `kernel` function, a `kernel_args(world_size, n_elem,
   cube_w, cube_h)` helper, and optionally a `TOPO_NAME_TO_KIND` map.
 3. **Resolve world_size** (D2).
 4. **Collect topology metadata** from `spec`: `n_sips`, `sip_topo`
   (`ring_1d` default), `cube_w`/`cube_h`, `sips.w`/`sips.h`. When the
   SIP topology is not `ring_1d`, derive `_sip_topo_w/h` from explicit
   `w`/`h` or via square-root (require `w*h == n_sips`). Mismatch raises
   `ValueError`.
 5. **Install SFR + IPCQ**:
   `kernbench.ccl.sfr_config.configure_sfr_intercube_multisip(engine,
   spec, self._merged)`. This pushes IPCQ neighbor tables to every
   SIP/cube's pe0 (one-time setup analogous to NCCL communicator
   creation).
 If the order changes (e.g., SFR runs before the algorithm module
 loads), partial initialization can result. So D3 is treated as an
 atomic 4-step block — on failure the backend remains uninstalled.
 ### D4. Greenlet-local rank binding (ADR-0024 D2)
 `DistributedContext._rank_by_greenlet: dict[greenlet, int]` maps spawned
 worker greenlets to their ranks. When the bench launcher (e.g.,
 `torch.multiprocessing.spawn`) spawns a worker, it registers via
 `dc._bind_rank(g, rank)`.
 `get_rank()` looks up `getcurrent()`'s greenlet. Unregistered greenlets
 fall back to 0 — preserves single-driver / test compatibility.
 The backend reads the current greenlet's rank from
 `_dist_ctx._rank_by_greenlet` during `all_reduce` (D5).
 ### D5. `all_reduce(tensor, op="sum")` behavior
 Validation:
 - `op != "sum"` → `NotImplementedError`. Current kernels only
  implement add reduction.
 - `tensor._handle is None` → `RuntimeError("not deployed")`.
 - `tensor._handle.shards` empty → `RuntimeError("no shards")`.
 Preparation:
 - `n_elem = shards[0].nbytes // tensor.itemsize` — element count of a
  single shard.
 - `kernel_fn = self._algo_module.kernel` — the algorithm module's entry
  function (imported in D3).
 - Decide effective cube dims: if the first SIP has just 1 cube, use
  `(1, 1)`; otherwise use the topology's `cube_w`/`cube_h`. This
  naturally absorbs TP runs that use only a subset of cubes.
 - `kernel_args = self._algo_module.kernel_args(world_size, n_elem,
  cube_w, cube_h)` — the algorithm decides which arguments to pass to
  its kernel.
 Dispatch:
 - Resolve the current greenlet's rank via
  `_rank_by_greenlet.get(g, 0)`.
 - Append `extra_args = (sip_rank, sip_topo_kind, sip_topo_w,
  sip_topo_h)`.
 - `pending = self.ctx.launch(algorithm_name, kernel_fn, tensor,
  *kernel_args, *extra_args, _defer_wait=True)` — `_defer_wait=True`
  delegates collective drain to the main scheduler (ADR-0027 D0.4).
 Drain:
 - If the parent greenlet is alive (multi-greenlet mode), enqueue
  `_pending_collective_handles` and switch to parent. The main
  scheduler drains after all ranks have launched.
 - If single-driver mode, drain inline:
  `for h, _sip_id, meta in pending: self.ctx.wait(h, _meta=meta)`.
 ### D6. `barrier()` is a no-op (single-driver model)
 kernbench runs all ranks as greenlets inside a single Python process,
 so no cross-process synchronization is needed. `barrier()` is callable
 but does no synchronization. Kept for real-PyTorch API compatibility so
 callers don't get `NotImplementedError`.
 If multi-process kernbench (SimPy event loop per process) is introduced
 in the future, D6 needs a superseding ADR.
 ### D7. Semantics of `get_rank` / `get_world_size` / `get_backend`
 - `get_rank()` (D4): the current greenlet's bound rank; unregistered → 0.
 - `get_world_size()` (D2): the world_size resolved by the backend in D3.
 - `get_backend()`: always the literal string `"ahbm"`. Calling before
  backend exists triggers `_ensure_initialized`'s RuntimeError.
 Differences vs. real PyTorch:
 - Real PyTorch `get_rank()` is a process-global value; here it is
  greenlet-local. Inside a spawned worker → the worker's rank; in the
  main thread → 0. Bench authors should expect meaningful ranks only
  inside worker functions.
 ### D8. Supported API surface (final)
 `DistributedContext` exposes:
 - `init_process_group(backend="ahbm", world_size=None, rank=None,
  **kwargs)`
 - `is_initialized() -> bool`
 - `get_world_size() -> int`
 - `get_rank() -> int`
 - `get_backend() -> str`
 - `all_reduce(tensor, op="sum") -> None`
 - `barrier() -> None`
 - (internal) `_bind_rank(g, rank)`
 Other PyTorch distributed APIs (`broadcast`, `reduce`, `all_gather`,
 `gather`, `scatter`, point-to-point `send/recv`, etc.) are **not
 implemented**. Kernel-level expression is available via
 `tl.send`/`tl.recv` (ADR-0046 D3.10), but the `dist.*` surface does not
 expose them. If additional collectives are needed, add a paired
 (algorithm module, `DistributedContext` method) and extend D8.
 ## Alternatives Considered
 ### A1. Create the backend in `RuntimeContext.__init__`
 Rejected. If `ccl.yaml` is missing or the algorithm module can't be
 imported, RuntimeContext construction would fail even when the bench
 does not use distributed features. Lazy creation at call time (D1) is
 the right semantics.
 ### A2. Always derive world_size from topology (no override)
 Rejected. ADR-0024 D1's "explicit override" path is used by legacy
 tests. Diagnostic scenarios that define PE-level ranks within a single
 SIP also need this escape hatch.
 ### A3. Silent fallback for unsupported `op`
 Rejected. If the user intends `op="prod"` / `"max"` / `"avg"` and silent
 `sum` runs instead, result validation gets very hard. Explicit
 `NotImplementedError` is safer.
 ### A4. Implement `barrier` as a SimPy event
 Rejected (currently). With single-driver semantics there is no
 cross-process synchronization to express, so a no-op is meaningfully
 correct. A fake-barrier SimPy event would add code complexity for no
 semantic gain. Revisit when multi-process kernbench arrives.
 ## Consequences
 - The 4-step installation (D3) for
  `torch.distributed.init_process_group(backend="ahbm")` is locked in,
  making clear where future collective algorithms must hook.
 - The priority order in D2 (algorithm > defaults > topology) makes the
  blast radius of ccl.yaml changes quickly knowable.
 - The no-op `barrier` (D6) is recorded so multi-process kernbench, if
  introduced, must explicitly supersede this ADR.
 - D8's list of unsupported APIs explicitly grounds the rejection
  message when users call, e.g., `dist.broadcast(...)`.
@@ -0,0 +1,278 @@
 # ADR-0048: Memory Allocator Algorithms — VirtualAllocator + PEMemAllocator
 ## Status
 Accepted (2026-05-22).
 Pins down the free-list algorithm, page alignment, and coalescing rules
 used by `policy/address/allocator.py`'s `_FreeList` / `PEMemAllocator`
 and `va_allocator.py`'s `VirtualAllocator`. ADR-0001 (PhysAddr layout)
 and ADR-0011 (PA/VA/LA models) define the address schemes; the
 **allocation algorithms** had no ADR-level coverage.
 ## First action
 ### `_FreeList(capacity)`
 On construction: `self._capacity = capacity`, `self._used = 0`,
 `self._free = [(0, capacity)]`. The first act is **establishing the
 entire region as one free block** — the tuple `(offset=0,
 size=capacity)` is the sole entry in the free list.
 ### `PEMemAllocator(sip_id, die_id, pe_id, cfg)`
 On construction, builds two `_FreeList`s:
 - `self._hbm = _FreeList(cfg.hbm_slice_bytes)` — the size of this PE's
  HBM slice (`hbm_bytes_per_cube // hbm_slices_per_cube`).
 - `self._tcm = _FreeList(cfg.tcm_allocatable_bytes)` — equals
  `tcm_bytes_per_pe - tcm_scheduler_reserved_bytes` (the scheduler
  reservation is pre-deducted).
 So PEMemAllocator's first act is **constructing single-free-block
 HBM-slice and TCM regions for this PE**.
 ### `VirtualAllocator(va_base, va_size, page_size=2*1024*1024)`
 On construction: `self._va_base = va_base`, `self._va_size = va_size`,
 `self._page_size = page_size`, `self._used = 0`, `self._free =
 [(va_base, va_size)]`. The first act is **establishing one block from
 va_base to va_size and stashing page_size**.
 ## Context
 `runtime_api/context.py::_ensure_allocators` builds the allocator set
 in these stages:
 1. Read `hbm_total_gb_per_cube`, `hbm_slices_per_cube`, `tcm_size_mb`,
   per-target_device SIP range, etc. from `spec`.
 2. Pack everything into a frozen `AddressConfig`.
 3. For every combination in the target SIP range × cubes × PEs,
   construct one `PEMemAllocator(sip, cube, pe, cfg)` instance.
 4. Construct one `VirtualAllocator(va_base=0x1_0000_0000, va_size=64
   GiB, page_size=pe_mmu.page_size)`.
 Allocator responsibilities:
 - **PEMemAllocator**: PA-space allocation in the PE-local HBM slice /
  TCM (including PhysAddr encoding).
 - **VirtualAllocator**: device-wide VA allocation, page-aligned.
  `RuntimeContext._create_tensor` then pushes VA → PA mappings to
  components via `MmuMapMsg`.
 These algorithms are:
 - **First-fit**, kept simple.
 - The free-block list is **sorted by start offset**.
 - On `free()`, **adjacent blocks coalesce**.
 The rationale was not documented anywhere, so when someone asks "why
 not best-fit?", "why not a buddy allocator?", "why does partial-overlap
 free pass silently?", there was no anchor to answer from. This ADR
 provides it.
 ## Decision
 ### D1. `_FreeList` — offset-keyed first-fit + coalescing
 `policy/address/allocator.py::_FreeList`:
 - Internal representation: `list[tuple[int, int]] = [(start_offset,
  size), ...]` — sorted by start offset.
 - `alloc(nbytes)`:
  1. Iterate the free list from the front (first-fit).
  2. Take from the first block with `size >= nbytes`.
  3. Exact match → drop the block; otherwise shrink it to `(start +
     nbytes, size - nbytes)`.
  4. `_used += nbytes`; return the taken `start`.
  5. If no block fits, `AllocationError("overflow ... largest free
     block ...")`.
 - `free(offset, nbytes)`:
  1. `_used -= nbytes`.
  2. `bisect_left(self._free, (offset,))` finds the insertion index.
  3. If adjacent to the previous block (`prev_start + prev_size ==
     offset`), merge.
  4. If adjacent to the next block (`offset + nbytes == next_start`),
     merge.
  5. Insert the coalesced range at the right sorted position.
 This algorithm is weaker than best-fit / buddy on fragmentation, but
 the simulator's workload (mostly stack-like deploy/free) tolerates it.
 If the workload shape changes, D1 is a supersession candidate.
 ### D2. Partial-overlap free is **not** validated
 `_FreeList.free(offset, nbytes)` trusts the caller to pass the exact
 `(offset, nbytes)`. It does **not** verify:
 - That the range was actually allocated.
 - That the range does not overlap another allocated region.
 Reason: in a simulator context, callers always store the return value
 of `alloc()` and pass it back to `free()` — there is no external user
 input. Adding a safety check would cost O(N) per free and impact
 simulation wall-clock.
 If this trust model breaks (e.g., a code path lets two tensors point
 at the same PA), this ADR must be revisited.
 ### D3. `PEMemAllocator` — two channels for HBM/TCM
 `PEMemAllocator(sip_id, die_id, pe_id, cfg)` holds two `_FreeList`s:
 - `_hbm`: size `cfg.hbm_slice_bytes`.
 - `_tcm`: size `cfg.tcm_allocatable_bytes` (= `tcm_bytes_per_pe -
  tcm_scheduler_reserved_bytes`).
 `alloc_hbm(nbytes) -> PhysAddr`:
 - `_hbm.alloc(nbytes)` → offset.
 - `PhysAddr.pe_hbm_addr(sip_id, die_id, pe_id,
  pe_local_hbm_offset=offset, slice_size_bytes=cfg.hbm_slice_bytes)`.
 - Failure raises `AllocationError("HBM overflow ...")`.
 `free_hbm(pa, nbytes)`:
 - Recover PE-local offset via `pa.hbm_offset - pe_id *
  cfg.hbm_slice_bytes`.
 - `_hbm.free(offset, nbytes)`.
 `alloc_tcm(nbytes) -> PhysAddr`: similar; uses `PhysAddr.pe_tcm_addr`.
 `free_tcm(pa, nbytes)`: uses `pa.sub_offset` directly (TCM's PE-local
 offset equals its sub_offset).
 The allocator does not see the scheduler-reserved TCM region
 (`cfg.tcm_scheduler_reserved_bytes`) — it is pre-subtracted from the
 `_tcm` capacity. This is consistent with ADR-0014's PE_SCHEDULER
 internal-buffer reservation.
 ### D4. `VirtualAllocator` — page-aligned first-fit + coalescing
 `policy/address/va_allocator.py::VirtualAllocator`:
 - Internal representation: same sorted `list[tuple[int, int]]` as
  `_FreeList`. Initially `[(va_base, va_size)]`.
 - `_align_up(nbytes) = ceil(nbytes / page_size) * page_size`.
 - `alloc(nbytes) -> int`:
  1. `aligned = _align_up(nbytes)`.
  2. First-fit a block with `size >= aligned`.
  3. Take `aligned` from the block's front; remove if exact.
  4. `_used += aligned`. Return the block's `start` (which is page-
     aligned).
  5. Failure → `VaAllocationError`.
 - `free(va, nbytes)`: free `_align_up(nbytes)` worth. Coalesces with
  the same algorithm as `_FreeList`.
 `page_size` has different defaults in two places:
 - `VirtualAllocator.__init__`'s parameter default: `2 MiB`. Direct-call
  tests receive this.
 - `RuntimeContext._ensure_allocators` when constructing the instance:
  `pe_mmu.attrs.get("page_size", 4096)` — uses
  `topology.yaml`'s `pe_mmu.attrs.page_size` if set, else falls back
  to `4 KiB`.
 The two defaults differ on purpose: `VirtualAllocator`'s standalone
 default (`2 MiB`) aligns with ADR-0039's PE_MMU stopgap default for
 direct-test ergonomics; the context fallback (`4 KiB`) is the safe
 minimum when `topology.yaml` doesn't specify a page size. The
 production path is always the latter (via `_ensure_allocators`), and
 when `topology.yaml` sets `page_size`, that value flows consistently
 into both the MMU and the VA allocator.
 If consistency breaks (e.g., VirtualAllocator instantiated with a
 page_size different from PE_MMU's), MMU `map()` falls into the
 sub-page region mode (ADR-0039 D3).
 VA range defaults: `va_base = 0x1_0000_0000` (= 4 GiB), `va_size = 64
 GiB`. These are hardcoded in `_ensure_allocators` and have no
 semantic meaning in ADR-0011's VA model — they simply reserve enough
 device-wide space without colliding with host code.
 ### D5. Lifecycle of allocator instances
 - `RuntimeContext._ensure_allocators` is lazy — called on the first
  `_create_tensor`.
 - The allocator dict (`self._allocators`) lives for the
  RuntimeContext's lifetime. A second deploy in the same process
  does not construct new objects.
 - `RuntimeContext.cleanup()` walks living tensors and calls
  `_free_tensor()`, which issues MMU unmaps + `va_allocator.free` +
  `pemem_allocator.free_hbm` — restoring the free lists. A subsequent
  RuntimeContext starts fresh.
 This per-RuntimeContext isolation guarantees deterministic deploy →
 cleanup → deploy sequences within a single process.
 ### D6. Allocator failure raises (no silent OOM)
 Both `_FreeList.alloc` and `VirtualAllocator.alloc` raise
 `AllocationError` / `VaAllocationError` when no block fits. The message
 includes "required size + largest available block" to distinguish
 fragmentation from true OOM.
 A silent fallback (e.g., allocating only as much as the largest free
 block) is strictly forbidden — a partially-allocated tensor reaching
 SimPy would cause routing / DMA to see incorrect PAs and silently
 corrupt simulation results.
 ### D7. One allocator per address space
 Physical address spaces are separated by PhysAddr sub-units (ADR-0001
 D2.3); each sub-unit gets its own allocator instance:
 - HBM slice → `PEMemAllocator._hbm`.
 - PE TCM → `PEMemAllocator._tcm`.
 - (Currently unused) M_CPU local memory, CUBE SRAM → would need their
  own allocators. Today these are handled as IPCQ-only slots (ADR-0023
  D9.7) and do not share PA space, so no free-list exists for them.
 When a cube-level SRAM allocator is needed,
 `_FreeList(cfg.sram_bytes_per_cube)` is added per-cube
 (`cfg.sram_bytes_per_cube` is already defined in `AddressConfig` —
 the data model is ready).
 ## Alternatives Considered
 ### A1. Best-fit / buddy allocator
 Rejected (currently). The workload's alloc/free pattern is stack-like
 (deploy order ≈ free order), so first-fit + coalescing controls
 fragmentation well enough. If long-running fragmentation appears in LLM
 kernel sweeps, a buddy-allocator ADR will replace D1.
 ### A2. Add partial-overlap free validation
 Rejected. D2's trust model plus the O(N) per-free cost makes this
 unattractive. A debug mode (e.g., `KERNBENCH_DEBUG` env var) that
 enables the check could be added later.
 ### A3. A unified allocator for VA and PA
 Rejected. VA space (64 GiB device-wide) and PA space (per-slice ~6
 GiB) have different semantic dimensions — VA is the kernel's view, PA
 is the device sub-unit's view. ADR-0011's VA model (MMU maps between
 the two) calls for separated allocators.
 ### A4. Multi-tier page sizes (large pages + small pages)
 Rejected (currently). A single page size (2 MiB) matches LLM kernel
 tensor sizes (a few MiB to GiB); smaller mappings are absorbed by
 ADR-0039 D3's sub-page region mode. Multi-tier paging would require
 extending the MMU model itself — a separate ADR candidate.
 ## Consequences
 - The allocator algorithm is pinned at ADR level (D1, D3, D4), so any
  future simulation scenario hitting fragmentation has a clear "we're
  using first-fit + coalescing" anchor to inspect.
 - D2's trust model is explicit, so any future code path that exposes
  alloc/free to direct user input will trigger this ADR's supersession
  early.
 - D7's one-allocator-per-sub-unit mapping is on record, so when M_CPU
  or SRAM need their own free-list, the addition point is obvious.
 - D4 captures the page_size dual-default and its production path
  (`_ensure_allocators` always wins), letting future `topology.yaml`
  `page_size` changes be assessed against ADR-0039's stopgap
  interaction quickly.
@@ -0,0 +1,247 @@
 # ADR-0049: `kernbench probe` Subcommand — Traffic-Pattern Verification Harness
 ## Status
 Accepted (2026-05-22).
 Pins down the traffic-pattern catalog, formula-vs-actual comparison, and
 invariant checks (monotonicity, D2H ≥ H2D, etc.) exposed by
 `probes/probe.py::run_probe(...)`. ADR-0010 (CLI surface) enumerates the
 `kernbench probe` subcommand, but **what probe actually measures** and
 **which invariants it judges PASS/FAIL** had no ADR-level coverage.
 ## First action
 `run_probe(topology_path, case_filter=None)` performs four startup steps:
 1. `Path(topology_path).expanduser().resolve()` → absolute path.
 2. `load_topology(path)` → `TopologyGraph` (graph + spec).
 3. `_build_edge_map(graph)` → a `{(src, dst): Edge}` lookup table.
 4. Instantiate `AddressResolver(graph)` + `PathRouter(graph)`.
 Then it sets `nbytes = 32768` (= 32 KiB, the summary-table reference
 size) and `show_all = (case_filter is None or case_filter == "all")`.
 In short, **probe's first act is "load the topology once and prepare
 edge map / resolver / router, plus pin 32 KiB as the standard measurement
 size"**. After that, the H2D → D2H → PE DMA categories execute in
 separate `GraphEngine` instances (no cross-talk between cases).
 ## Context
 `kernbench probe` was introduced as a verification tool for these
 purposes:
 - **Manual ground truth**: when a real-simulation result (`kernbench run
  --bench ...`) shows abnormal latency, derive the answer for a simple
  traffic pattern in isolation and compare.
 - **Formula vs actual**: check whether the analytical model
  (wire latency + overhead + drain) matches the simulator's
  `total_ns`. A mismatch points to which simplifying assumption in
  ADR-0033 is missing.
 - **Monotonicity check**: latency should grow monotonically with hop
  count.
 - **Utilization sweep**: a BW-utilization table across data sizes
  (4 KiB ~ 1 MiB).
 Without an ADR for this tool:
 - Adding a new traffic-pattern category (e.g., MCpuDma, IPCQ) is hard
  because the table format / measurement units of existing categories
  aren't documented at the ADR level.
 - The basis for the monotonicity check (hop count? cube distance? wire
  length?) is ambiguous.
 - The reference size 32 KiB and the sweep `[4 KiB, 16 KiB, 64 KiB, 256
  KiB, 1 MiB]` are only discoverable by reading source.
 ## Decision
 ### D1. Three case categories — H2D / D2H / PE DMA
 Each category has a distinct data path in the topology and gets its own
 summary table + sweep table + route-detail block.
 - **H2D (Host → Device Write)**: `MemoryWriteMsg(dst_sip=0, dst_cube,
  dst_pe=0, pattern="zero")` flows along `pcie_ep → io_cpu → m_cpu →
  hbm_ctrl`. The cube index varies the hop count:
  - h2d-1hop: cube=0, hops=1
  - h2d-2hop: cube=4, hops=2
  - h2d-3hop: cube=8, hops=3
  - h2d-4hop: cube=12, hops=4
 - **D2H (Device → Host Read)**: `MemoryReadMsg(src_sip=0, src_cube,
  src_pe=0)`. Total latency = forward command path + reverse data path.
  Same 4-hops category as H2D.
 - **PE DMA (PE-initiated)**: `PeDmaMsg(src_sip, src_cube, src_pe,
  dst_pa)`. Five cases cover varying cube/PE positions:
  - pe-local-hbm: same cube, same PE
  - pe-same-half-hbm: same cube, different PE (PE 1)
  - pe-cross-half-hbm: same cube, far PE (PE 4)
  - pe-cross-cube-hbm-best: adjacent cube (cube 1)
  - pe-cross-cube-hbm-worst: diagonal far cube (cube 15)
 The cube indices 4/8/12 (H2D) and 1/4/15 (PE DMA) are meaningful for a
 4 × 4 cube mesh (`sip.cube_mesh.w=4, h=4`); changes to the mesh size
 require these to be updated in lockstep.
 ### D2. Standard measurement size — `nbytes = 32768` (32 KiB)
 Every case in the summary table runs once with `nbytes=32768`. 32 KiB
 was chosen because:
 - DMA overhead and BW drain are balanced — neither dominates.
 - It compares cleanly against the one-shot transfer size of several
  sub-units (TCM, register file).
 Per-size utilization variations are shown in a separate sweep table
 (D3).
 ### D3. Utilization sweep — `[4 KiB, 16 KiB, 64 KiB, 256 KiB, 1 MiB]`
 `SWEEP_SIZES = [4096, 16384, 65536, 262144, 1048576]`,
 `SWEEP_LABELS = ["4KB", "16KB", "64KB", "256KB", "1MB"]`. Per size:
 ```
 drain   = nbytes / bottleneck_bw
 total   = overhead + wire + drain
 eff_bw  = nbytes / total
 util%   = eff_bw / bottleneck_bw × 100
 ```
 When `bn_bw is None or <= 0`, the column shows 0.0 %. The intent: the
 table shows in one view how small transfers become overhead-bound and
 large transfers become drain-bound as hop count rises.
 ### D4. Measured columns — actual / formula / breakdown
 Per-case columns:
 - `Actual` (total_ns): the SimPy run's `trace["total_ns"]`.
 - `Ovhd`: sum of `node.attrs["overhead_ns"]` along the path (formula).
 - `Drain`: `nbytes / min(edge.bw_gbs over path)` (formula).
 - `Wire`: `Σ edge.distance_mm * (ns_per_mm from spec)`.
 - `Ovhd%` / `Drain%`: each portion as a percentage of Actual. Wire is
  usually too small to display.
 - `Eff.BW`: `nbytes / total_ns` (measured BW).
 - `BN.BW`: bottleneck bandwidth (formula). The minimum edge BW along
  the path. Missing edge BW shows "-".
 - `Util%`: `Eff.BW / BN.BW × 100`. 100 % means the single-stream BW
  upper bound is reached.
 A large gap between the formula sum (`wire + ovhd + drain`) and Actual
 signals a factor the simplified model misses (a place to inspect
 ADR-0033's assumptions).
 ### D5. Automatic invariant checks — PASS/FAIL
 The following invariants are reported with `[v] PASS` / `[x] FAIL`:
 - **H2D / D2H monotonic increase**: as hop count rises, actual latency
  must grow monotonically. `all(lats[i] < lats[i+1] for ...)`.
 - **D2H ≥ H2D**: for the same hop index, D2H ≥ H2D (D2H has both
  forward command and reverse data legs). `all(d2h[i].total >=
  h2d[i].total)`.
 - **PE DMA best < worst**: cross-cube best (adjacent) latency must be
  less than cross-cube worst (diagonal).
 - **PE DMA local vs remote**: prints the local BN BW vs remote BN BW
  side-by-side (informational, not PASS/FAIL).
 When a check fails, a single clear line surfaces the regression for
 human review.
 ### D6. Route detail — per-hop timestamp trace
 After the summary and sweep tables, each case's path and cumulative
 per-hop timestamps (`_hop_timestamps`) appear in a separate section:
 - H2D: leg1 (`pcie_ep → io_cpu`) + leg2 (`io_cpu → m_cpu`) + leg3
  (`m_cpu → hbm_ctrl`) + per-hop trace.
 - D2H: forward (cmd, no data) and reverse (data) traces shown
  separately.
 - PE DMA: `pe_dma → router → hbm_ctrl` path + per-hop trace.
 Each hop's timestamp is cumulative `wire_ns + overhead_ns`. The
 terminal hop's annotation appends `drain:Xns`. Bottleneck edges are
 marked `<BN:XXGB/s>` so they are visually identifiable.
 ### D7. Semantics of the `case_filter` argument
 - `None` or `"all"`: run all cases (default).
 - Other strings: run only the case whose name matches exactly. Example:
  `kernbench probe --case h2d-2hop`.
 Within a category, cases with `name != case_filter` are skipped; if
 only one data point remains, the category's monotonicity / D2H ≥ H2D
 comparisons are naturally skipped.
 The CLI parser's `--case` default is `"all"`, so omitting it runs
 everything.
 ### D8. Fresh GraphEngine per case
 Each of the 4 H2D, 4 D2H, and 5 PE DMA cases runs in **its own
 GraphEngine** (`engine = GraphEngine(graph)`). Reasons:
 - Isolate accumulated state (op_log, completion tracking, allocators)
  so cases do not cross-talk.
 - Guarantee one case's traffic does not perturb another case's BW
  measurement.
 This isolation lets probe results be interpreted as **single-flow**
 per-case latency. Multi-flow contention measurement is handled by
 separate tooling (e.g., the `pe2pe_overview` plot or ADR-0033's
 multi-flow merging model).
 ### D9. Output-format stability
 probe's stdout is meant for humans; precise column widths, separators,
 and whitespace are **not** a machine-readable contract. Automated tools
 that wish to parse probe output should use a separate JSON-output mode
 (not yet implemented).
 The `[v]` / `[x]` prefix on PASS/FAIL lines is a stable CI grep anchor.
 ## Alternatives Considered
 ### A1. Register probe as another bench (`@bench(name="probe")`)
 Rejected. probe is a verification tool, not a bench — multi-engine
 execution for sweeps/analysis and PASS/FAIL invariant output are
 essential, none of which fits ADR-0045's "single device + single
 RuntimeContext" bench model.
 ### A2. Exit code 1 on monotonicity violation
 Rejected (currently). probe is positioned as a human inspection tool —
 PASS/FAIL is printed and exit is 0. A wrapper can `grep "\[x\]"` to
 decide. A future `--strict` flag could opt into non-zero exits.
 ### A3. Externalize the case catalog to YAML
 Rejected (currently). The 8 cases (4 H2D + 4 D2H + 5 PE DMA = 13 total)
 are hardcoded and their semantics are tightly bound to the mesh
 topology. Moving cube-index meaning (4, 8, 12 / 1, 4, 15) into YAML
 would require separate documentation and lose cohesion. Externalize
 only when case additions become frequent.
 ### A4. Add multi-flow contention measurement
 Rejected (out of probe scope). D8's single-flow isolation is probe's
 core intent. Multi-flow contention belongs in a different area of the
 ADR-0033 latency model — either a separate tool or a new case
 category.
 ## Consequences
 - probe's case catalog (D1) and measurement units (D2/D3) are pinned at
  ADR level, so new traffic categories know which table format to
  follow.
 - The semantics of the formula-vs-actual columns (D4) are locked in, so
  questions like "why is Drain% 5 % or 70 %?" can quickly be linked to
  ADR-0033 assumption checks.
 - Automatic invariant checks (D5) are pinned, so latency-model changes
  immediately catch monotonicity / D2H ≥ H2D regressions.
 - D8's case-isolation is explicit, so probe results are safe to read as
  single-flow measurements. If multi-flow is needed, a separate tool
  track is clearly required.
 - A2's strict-mode flag is recorded as a follow-up so CI integration
  has a minimal change path when requested.